vtype_register

最近搞了搞risc-v v，vtype寄存器可谓是灵魂的存在了。

在此记录一下关于vtype寄存器相关的知识和结论。

Background

With a vector length specific (VLS) SIMD instruction set the main problem is to pick the right vector register size. Of course there is a trade-off between the amount of data-level parallelism and hardware costs. Due to Moore's law, vector register sizes can be increased over time without making the CPU chip more expensive. Also, some users are interested in powerful CPUs with wider vector registers while the average user is fine with averagely sized register. Thus, there is no one right vector register size.

对于向量长度特定 (VLS) SIMD 指令集，主要问题是选择正确的向量寄存器大小。当然，在数据级并行性和硬件成本之间存在权衡。由于摩尔定律，矢量寄存器的大小可以随着时间的推移而增加，而不会增加 CPU 芯片的成本。此外，一些用户对具有更宽向量寄存器的强大 CPU 感兴趣，而普通用户则对平均大小的寄存器感兴趣。因此，没有一个正确的向量寄存器大小。

The solution to all this is to design a variable length vector instruction set. In that way the instructions are then agnostic to the vector register size of a concrete CPU implementation. Thus, the binary code is portable between low, middle and high-end CPUs, and automatically makes use of wider registers in newer CPUs.

解决方案是设计一个可变长度的向量指令集。这样，指令就与具体 CPU 实现的向量寄存器大小无关。因此，二进制代码可在低端、中端和高端 CPU 之间移植，并自动在较新的 CPU 中使用更宽的寄存器。

个人感觉，RISC-V这个后起之秀就胜在不用顾及历史积累，可以很轻易的从头开始。

Design Challenges: Opcode Space

Community wanted to stay with 32-bit instruction encoding

Low-end embedded systems have 32-bit instruction fetch
Harder to support mixed-length instruction streams, 16,32,48, & 64 bits
Static code size matters on embedded platforms

But wanted vast array of datatypes and custom datatypes, and large set of operations

Opcode Solution

There are two registers used when operating vectors in RVV

寄存器vtype: Vector Type.

Added a control register to hold some information about current setting of vector unit

vtype describes the type of vector we are going to operate and includes

vtype fields (total additional 6-7 state bits) - vsew: standard element width (SEW=8,16,32,…,1024) - Size in bits of the elements being operated - 8 ≤ sew ≤ ELEN - vlmul: vector length multiplier (LMUL=1,2,4,8) - $lmul = 2^k$ where -3 ≤ k ≤ 3 (i.e., lmul ∈ {1/8, 1/4, 1/2, 1, 2, 4, 8}) - 寄存器分组 Groups registers to form “longer vector” - Number of registers in each group is LMUL - 举例：当 LMUL=2时，vadd v2, v4, v6 意味着 (v2,v3) := (v4,v5) + (v6,v7) - vediv: vector element divider (EDIV=1,2,4,8)

Encoding only occupies 1.5 major opcodes

Full 64-bit instruction encoding also planned - Can view current 32-bit encoding as compressed form of full encoding

寄存器vl: Vector Length (not to be confused with VLEN!)

vl describes how many elements of the vector (starting from the element zero) we are going to operate

$ 0 ≤ vl ≤ vlmax(sew, lmul) $
$ vlmax(sew, lmul) = (VLEN / sew) × lmul $

Vector length $vl$ set to min(AVL, VLMAX)

RVV defines 32 vector registers of size VLEN bits

v0 to v31
VLEN is a constant parameter chosen by the implementor and must be a power of two
- Zv* standard extensions constraint VLEN to be at least 64 or 128
E.g., VLEN=512 would be equivalent in size to Intel AVX-512
VLEN is not a great name so read it as “vector register size (in bits)”

Vectors in RVV are divided in elements.

The size of elements in bits is at least 8 bits up to ELEN bits
ELEN is a constant parameter chosen by the implementor
Must be a power of two and 8 ≤ ELEN ≤ VLEN
- Zv* standard extensions constrain ELEN to be at least 32 or 64

Vector Length control

Current maximum vector length is register length in bits divided by current element width setting:

\[VLMAX = VLEN/SEW\]

E.g., VLEN = 512b, SEW=32b => VLMAX = 16

Current active vector length set by vl register

\[0<=vl<=VLMAX\]

vsetvli/vsetvl instructions

$vsetvli$ instruction sets both vector configuration and vector length:

vsetvli rd，rs1，vtypei(e8/e16/e32/e64)，multi-num(m1-m8)

rd - Returns setting of vector length in scalar register - 返回值，告诉硬件接下去运行函数所需要的操作元素个数的量

rs1 - Application vector length (AVL) - 设置操作元素

vtypei - 每个元素的位宽大小

multi-num - 寄存器的连用数量

在 rv64 的编写中，如果 rd 的返回值不容，会导致硬件部分中心进行设置，此步骤极大的占用运行时间，所以在编写的时候需要尽量可能的减少 rd 数值的变换。

rv64 中每个 vector 寄存器总位宽是大于等于 128-bit, 以128为例子，对于如何对 rd 进行数值的变换，公式方面有一个很简单的换算：

rd == rs1 <= (vtypei * multi-num ) ? rs1 : (vtypei * multi-num )
举个例子：

vsetvli rd, rs1 == 8， vtypei == e8， multi-num == m1
返回值便是 rd == 8 (128 / e8 == 16 >= rs1 => 8)

vsetvli rd, rs1 == 32，vtypei == e16， multi-num == m1
返回值便是 rd == 16 (128 / e16 = 16)

如果一段代码中同时存在上述两条指令，那么便会出现硬件重设置情况，对代码整体的性能会产生负面影响。

Simple memcpy example

# void*memcpy(void*dest,const void*src, size_t n)
# a0=dest, a1=src, a2=n
memcpy:
mv a3, a0 # Copy destination
loop:
vsetvli t0, a2, e8 # Vectors of 8b
vlb.v v0, (a1) # Load bytes
add a1, a1, t0 # Bump pointer
sub a2, a2, t0 # Decrement count
vsb.v v0, (a3) # Store bytes
add a3, a3, t0 # Bump pointer
bnez a2, loop # Any more?
ret # Return

usability code

通过intrinsic的方式，查询vlenb，进而可以兼容不同位宽的设备。

static inline int csrr_vl()
{
    int a = 0;
    asm volatile("csrr %0, vl"
                 : "=r"(a)
                 :
                 : "memory");
    return a;
}

static inline int csrr_vtype()
{
    int a = 0;
    asm volatile("csrr %0, vtype"
                 : "=r"(a)
                 :
                 : "memory");
    return a;
}

static inline int csrr_vlenb()
{
    int a = 0;
    asm volatile("csrr %0, vlenb"
                 : "=r"(a)
                 :
                 : "memory");
    return a;
}

整理的很乱，暂时没时间仔细梳理了。

参考文献

Programming with RISC-V Vector Instructions

RISC-V 小白优化学习

Vector Extension Update

Adventures with RISC-V Vectors and LLVM