最近搞了搞risc-v v,vtype寄存器可谓是灵魂的存在了。
在此记录一下关于vtype寄存器相关的知识和结论。
Background
With a vector length specific (VLS) SIMD instruction set the main problem is to pick the right vector register size. Of course there is a trade-off between the amount of data-level parallelism and hardware costs. Due to Moore's law, vector register sizes can be increased over time without making the CPU chip more expensive. Also, some users are interested in powerful CPUs with wider vector registers while the average user is fine with averagely sized register. Thus, there is no one right vector register size.
对于向量长度特定 (VLS) SIMD
指令集,主要问题是选择正确的向量寄存器大小。
当然,在数据级并行性和硬件成本之间存在权衡。
由于摩尔定律
,矢量寄存器的大小
可以随着时间的推移而增加,而不会增加
CPU 芯片的成本。 此外,一些用户对具有更宽
向量寄存器的强大
CPU 感兴趣,而普通用户则对平均大小
的寄存器感兴趣。
因此,没有一个正确的向量寄存器大小
。
The solution to all this is to design a variable length vector instruction set. In that way the instructions are then agnostic to the vector register size of a concrete CPU implementation. Thus, the binary code is portable between low, middle and high-end CPUs, and automatically makes use of wider registers in newer CPUs.
解决方案是设计一个可变长度的向量指令集
。
这样,指令就与具体 CPU 实现的向量寄存器大小无关。
因此,二进制代码可在低端、中端和高端 CPU 之间移植,并自动在较新的 CPU
中使用更宽的寄存器。
个人感觉,RISC-V这个后起之秀就胜在不用顾及历史积累,可以很轻易的从头开始。
Design Challenges: Opcode Space
Community wanted to stay with 32-bit instruction encoding
- Low-end embedded systems have 32-bit instruction fetch
- Harder to support mixed-length instruction streams, 16,32,48, & 64 bits
- Static code size matters on embedded platforms
But wanted vast array of datatypes and custom datatypes, and large set of operations
Opcode Solution
There are two registers used when operating vectors in RVV
寄存器vtype: Vector Type.
Added a control register to hold some information about current setting of vector unit
vtype describes
the type of vector we are going to operate
and includes
vtype fields (total additional 6-7 state bits) - vsew: standard
element width (SEW=8,16,32,…,1024) - Size in bits of the elements being
operated - 8 ≤ sew ≤ ELEN
- vlmul: vector length multiplier
(LMUL=1,2,4,8) - \(lmul = 2^k\) where
-3 ≤ k ≤ 3 (i.e., lmul ∈ {1/8, 1/4, 1/2, 1, 2, 4, 8}) - 寄存器分组
Groups registers to form “longer vector” - Number of registers in each
group is LMUL - 举例:当 LMUL=2时,vadd v2, v4, v6
意味着
(v2,v3) := (v4,v5) + (v6,v7)
- vediv: vector element
divider (EDIV=1,2,4,8)
Encoding only occupies 1.5 major opcodes
Full 64-bit instruction encoding also planned - Can view current 32-bit encoding as compressed form of full encoding
寄存器vl: Vector Length (not to be confused with VLEN!)
vl describes how many elements of the vector
(starting
from the element zero) we are going to operate
$ 0 ≤ vl ≤ vlmax(sew, lmul) $
$ vlmax(sew, lmul) = (VLEN / sew) × lmul $
Vector length \(vl\) set to min(AVL, VLMAX)
RVV defines 32 vector registers of size VLEN bits
- v0 to v31
- VLEN is a
constant parameter
chosen by the implementor and must be a power of two- Zv* standard extensions constraint VLEN to be at least 64 or 128
- E.g., VLEN=512 would be equivalent in size to Intel AVX-512
- VLEN is not a great name so read it as
“
vector register size (in bits)
”
Vectors in RVV are divided in elements.
- The size of elements in bits is at least 8 bits up to ELEN bits
- ELEN is a constant parameter chosen by the implementor
- Must be a power of two and 8 ≤ ELEN ≤ VLEN
- Zv* standard extensions constrain ELEN to be at least 32 or 64
Vector Length control
Current maximum vector length is register length in bits divided by current element width setting:
\[VLMAX = VLEN/SEW\]
- E.g.,
VLEN = 512b, SEW=32b
=>VLMAX = 16
Current active vector length set by vl register
\[0<=vl<=VLMAX\]
vsetvli/vsetvl instructions
\(vsetvli\) instruction sets both vector configuration and vector length:
vsetvli rd,rs1,vtypei(e8/e16/e32/e64),multi-num(m1-m8)
rd - Returns setting of vector length in scalar register - 返回值,告诉硬件接下去运行函数所需要的操作元素个数的量
rs1 - Application vector length (AVL) - 设置操作元素
vtypei - 每个元素的位宽大小
multi-num - 寄存器的连用数量
在 rv64 的编写中,如果 rd
的返回值不容,会导致硬件部分中心进行设置,此步骤极大的占用运行时间,所以在编写的时候需要尽量可能的减少 rd 数值的变换
。
rv64 中每个 vector 寄存器总位宽是大于等于 128-bit, 以128为例子,对于如何对 rd 进行数值的变换,公式方面有一个很简单的换算:
1 | rd == rs1 <= (vtypei * multi-num ) ? rs1 : (vtypei * multi-num ) |
如果一段代码中同时存在上述两条指令,那么便会出现硬件重设置情况,对代码整体的性能会产生负面影响。
Simple memcpy example
1 | # void*memcpy(void*dest,const void*src, size_t n) |
usability code
通过intrinsic的方式,查询vlenb
,进而可以兼容不同位宽的设备。
1 | static inline int csrr_vl() |
整理的很乱,暂时没时间仔细梳理了。