This project records the process of optimizing SGEMM (single-precision floating point General Matrix Multiplication) on the riscv platform.

General Matrix Multiplication (GEMM) is one of the core computing units in deep learning frameworks, widely used for implementing operators such as Convolution, Full Connection, and Matmul.

I conducted experiments and explorations on the Allwinner Nezha D1 development board. Versions 0 through 5 were implemented using the C language, while versions 6 through 9 partially utilized assembly language, involving the RISC-V V extension instructions.

Note: Unlike some other GEMM optimization projects, this project uses row-major order matrices exclusively. Because I prefer row-major order!

Prerequisite Knowledge

RISC-V is an open standard Instruction Set Architecture (ISA) enabling a new era of processor innovation through open collaboration.

GEMM General matrix multiply, one of the Basic Linear Algebra Subprograms.

FLOPS stands for Floating-point Operations Per Second, also known as peak speed per second. It refers to the number of floating-point operations executed per second. One gigaflop (GFLOPS) equals ten to the power of nine (10^9) floating-point operations per second.

The computational complexity of matrix multiplication is calculated as 2 * M * N * K. The ratio of computational complexity to time taken gives the GFLOPS of the current GEMM version. - Multiplying by 2 is because each operation involves one multiplication and one addition.

Prepare

The related code is located in ./prepare/.

Test Cross-Compilation

I use the Allwinner Nezha D1 development board, and downloaded the cross-compilation link from here.

For detailed instructions, please refer to the readme.

Memory Bandwidth Test

I conducted memory bandwidth tests on the development board using the following projects:

Roofline Model

Roofline proposes a method for quantitative analysis using "Operational Intensity" and provides a formula for the theoretical performance limit achievable on computational platforms.

According to OpenPPL Public Course | RISC-V Technical Analysis:

The computing power of D1 can reach 4 GFlops (@1GHz).
Memory: 2.727 GB/s (DDR3 792 MHz).
- Although I measured the highest as 2.592 GB/s, there may be some problems somewhere?
- Let's trust Sensetime for now, temporarily accept its value.

SGEMM Optimization

Related code is located in ./sgemm/.

Usage

Take step0 as an example, you need to edit the Makefile first to configure your cross-compilation chain.

$ cd sgemm/step0/
$ make
$ adb push test_bl_sgemm_step0.x ./.
$ adb shell './test_bl_sgemm_step0.x'

Version 0: Naive Version

This version seems to be the most intuitive to me, after all, this is how I learned, understood, and computed matrix multiplication: > Multiply one row of A by one column of B to get one element of C.

for ( i = 0; i < m; i ++ ) {              // Start 2-th loop
    for ( j = 0; j < n; j ++ ) {          // Start 1-nd loop
        for ( p = 0; p < k; p ++ ) {      // Start 0-st loop
            C( i, j ) += A( i, p ) * B( p, j );
        }                                 // End   0-th loop
    }                                     // End   1-st loop
}                                         // End   2-nd loop

I think version 0 very well explains the formula \(C_{mn} = \sum_{k=1}^{K} A_{mk}B_{kn}\).

However, this version has obvious shortcomings: on a platform with a theoretical computing power of 4 GFLOPS, it only achieves a maximum computational performance of 0.03 GFLOPS. This is because the access to matrix B has a very low cache hit rate, i.e., "poor spatial locality". Throughout the calculation, it is equivalent to accessing matrix B many, many times.

It is advisable to access the elements of multi-dimensional arrays in sequential order. This can improve the spatial locality of memory access and make it more friendly to the cache.

Furthermore, it can be observed that with the increase in size, the performance fluctuates significantly. Analysis of the data shows that when m=n=k is 128 164 192 228 256 288 320 352 384, the performance is poor. These numbers differ by 32, and 32 * 4 (sizeof(float)) = 128 B.

It is speculated that the performance fluctuation is related to cacheline and hardware prefetching -- cacheline = 64B, after cache miss, hardware prefetching, i.e., HWPrefetcher, reads one more cacheline.

Version 1: Loop Interchange Version

Reusing data in the cache is the most basic and efficient use of cache. For nested loops, loop interchange, loop reversal, loop fusion, loop distribution, loop tiling, loop unrolling and jam, etc., can be used to improve program performance.

Selecting an appropriate loop transformation method can both maintain the semantics of the program and improve its performance.

for ( i = 0; i < m; i ++ ) {              // Start 2-th loop
    for ( p = 0; p < k; p ++ ) {          // Start 1-st loop
        for ( j = 0; j < n; j ++ ) {      // Start 0-nd loop
            C( i, j ) += A( i, p ) * B( p, j );
        }                                 // End   0-th loop
    }                                     // End   1-st loop
}                                         // End   2-nd loop

Compared with version 0, version 1 has better spatial locality for the operation on matrix B, and the performance has been greatly improved (especially for larger sizes, while for m = n = k <= 68, the efficiency of version 0 is higher).

Adjusting the order of m, n, and k does not affect the result (i.e., maintaining the semantics of the program), but it can affect the performance. Testing the performance of different loop orders (using the Allwinner Nezha D1 platform, with m=n=k=512 as an example)

Loop Order	GFLOPS	Analysis
MNK	0.012	High cache miss for accessing B
MKN	0.180
NMK	0.012	High cache miss for accessing B
NKM	0.009	High cache miss for accessing A
KMN	0.165
KNM	0.009	High cache miss for accessing A

However, the hardware utilization of version 1 is still very low, and further optimizations are needed.

Version 2: Blocking Version

for ( i = 0; i < m; i += DGEMM_MR ) {          // Start 2-nd loop
    for ( j = 0; j < n; j += DGEMM_NR ) {      // Start 1-st loop
        AddDot_4x4_opt( k, &A( i, 0 ), lda, &B( 0, j ), ldb, &C( i, j ), ldc );
    }                                          // End   1-st loop
}                                              // End   2-nd loop

To avoid unnecessary cache swapping, blocking processing is performed. Discussing Why Blocking Matrix Optimization Works is a good read, I recommend learning from it.

After performing block operations in version 2, the performance is still not satisfactory. This is because, although this version superficially implements blocking logic, there are still some small tricks in the calculation within the block that have not been applied.

Version 3: Blocked Optimization Version

AddDot_4x4_opt has been added.

Several tricks are mentioned inBLISlab-tutorial:

2.4.2 Loop unrolling
- Updating loop index i and the pointer cp every time through the inner loop creates considerable overhead. For this reason, a compiler will perform loop unrolling.
2.4.3 Register variables
- Notice that computation can only happen if data is stored in registers. A compiler will automatically transform code so that the intermediate steps that place certain data in registers is inserted.

After using these tricks, this version has significantly improved performance!

However, for larger matrix sizes, the performance of this version is still relatively low. Upon investigation, for example, after accessing B[0,0], B[0,1], B[0,2], B[0,3], when accessing B[1,0], when the size is large, there must be a cache miss. Therefore, it would be great if the data could be rearranged in advance.

Version 4: B prepack Version

I assume matrix B is parameter, so we can perform the pack operation in advance. Version 4 prepack matrix B, leading to further performance improvement!

The reason for the performance improvement is evident: there is a significant reduction in cache misses when accessing matrix B. This is the first time I deeply realized the importance of prepacking neural network weights before model inference.

It can be seen that when the size is relatively large, the performance still declines. This should be due to a high number of cache misses when accessing matrix A. Should we pack A?

I assume matrix A is input, so packing A cannot be done in advance and must be included in the overall timing. Is it necessary?

Version 5: A pack & B prepack Version

Based on Version 4, Version 5 performs packing on matrix A.

Here, since matrix A is assumed to be an input, packing A needs to be performed during computation, and this time consumption needs to be included in the timing.

The results are still pleasing, especially with large matrix sizes, achieving further performance improvements.

I initially approached this experiment with a trial-and-error mindset, considering the additional read of A and writing of packA. It seems the main challenge ahead lies in combating cache misses.

The current optimization direction has reached its limit. It's worth trying to do some preload during the calculation process.

Next, we'll move to assembly, work on vector calculations, and implement preload in assembly.

Version 6: Assembly Version

Brief explanation: A is not packed, but B is prepacked with 16 numbers.

for ( i = 0; i < m; i += DGEMM_MR ) {       // Start 2-nd loop
    int mb = DGEMM_MR;
    if((m - i) < DGEMM_MR) mb = m - i; 
    for ( j = 0; j < n; j += DGEMM_NR ) {   // Start 1-st loop
        int nb = DGEMM_NR;
        if((n - j) < DGEMM_NR) nb = n - j; 
        RvvSgemm4x16(   nb,                 // nr <= 16, a0
                        mb,                 // mr <= 4,  a1
                        k,                  // astride = k*sizeof(float), a2
                        &A[i * k],          // mr * k,   a3
                        &packB[j * k],      // k * 16,   a4
                        &C( i, j ),         // mr * nr,  a5
                        n * sizeof(float),  // Len(N) * sizeof(float), a6
                        bias
                    );
    }                                       // End   1-st loop
}                                           // End   2-nd loop

Regarding the use of rvv instructions, I believe vsetvli is essential, and vfmacc.vf is the mainstay.

I have learned a lot from OpenPPL Course | RISC-V Technical Analysis. They are truly professional! I recommend learning theoretical guidance and knowledge points from them, paying tribute to OpenPPL!

As for assembly operators, there are many details in assembly, and I strongly complain: writing assembly is really annoying! Especially the debugging process, it's torturous. The last time I wrote assembly was during my undergraduate classes. Picking it up again brings some novelty and excitement, and being able to control the execution of operators at a very fine granularity gives a great sense of accomplishment.

Regarding how the assembly files are implemented specifically, I believe the fastest way is to look at the assembly code. I won't explain it further here.

It should be noted that this version's performance is very poor. Why is that? It's another issue of loop order.

Version 7: Assembly Version with Loop Order Swapped

Brief explanation: A is not packed, but B is prepacked with 16 numbers.

for ( j = 0; j < n; j += DGEMM_NR ) {       // Start 2-st loop
    int nb = DGEMM_NR;
    if((n - j) < DGEMM_NR) nb = n - j; 
    for ( i = 0; i < m; i += DGEMM_MR ) {   // Start 1-nd loop
        int mb = DGEMM_MR;
        if((m - i) < DGEMM_MR) mb = m - i; 
        RvvSgemm4x16(   nb,                 // nr <= 16, a0
                        mb,                 // mr <= 4,  a1
                        k,                  // astride = k*sizeof(float), a2
                        &A[i * k],          // mr * k,   a3
                        &packB[j * k],      // k * 16,   a4
                        &C( i, j ),         // mr * nr,  a5
                        n * sizeof(float),  // Len(N) * sizeof(float), a6
                        bias
                    );
    }                                       // End   1-st loop
}                                           // End   2-nd loop

Reversing the order of loops, starting with the n-direction followed by the m-direction, significantly improves performance.

However, the performance of large-sized matrices is still not very good. The root cause remains in memory access. The computation of large-sized matrices in the roofline model is considered compute-bound, where ideally the compute time and memory access time should overlap as much as possible. Currently, a significant amount of time is spent on memory access (mostly due to cache miss!).

Version 8: Assembly Version with Preload

Brief Explanation: Matrix A is not packed, while Matrix B undergoes prepackaging of 16 elements and includes a preload operation.

The performance is explosive! It reaches a maximum of 2.212 GFLOPS.

Core operations:

vfmacc.vf v16,  ft0, v0
vlw.v v4, (bp0)         # b0'->v4
flw fs4, 384(bp0)       # pre-load B
addi bp0,bp0,64
vfmacc.vf v20,  ft1, v0

Inserting some load operations between vfmacc.vf instructions preloads the data that will be used later into the cache, significantly reducing cache miss.

Initially, I was puzzled—how can the compute time and memory access time overlap when the code seems to execute sequentially? It wasn't until later that I understood the essence here, which lies in the principle of cacheline. Indeed, foundational knowledge is crucial!

Version 9: Assembly Version with A Packed

Based on previous experience, an attempt was made to pack Matrix A, but surprisingly, the results were not very good. A brief analysis suggests that the preload for Matrix A in this version of the assembly code might not be optimized.

In the previous version, although A wasn't packed, there was preload for A's 4 rows, which also addressed the pain point of cache miss for Matrix A.

Conclusion

To continue optimizing this operator, there is still much to be done, such as rearranging pipelines in assembly.

As mentioned in the OpenPPL Course | RISC-V Technical Analysis, using vf instructions can achieve 80% of the theoretical peak, which is 4*80%, 3.2 GFLOPs. I currently have only 2.121 GFLOPs, indicating there's still a lot of room for optimization theoretically.

Furthermore, the RVV currently uses version 0.7.1, and it seems there's still much work to be done in RVV instruction optimization, such as the serious efficiency problem encountered with vlw.

In conclusion, working on these tasks has allowed me to learn a lot from many experts, and I'm very grateful. I also hope this article can help more people.

Acknowledgement

BLISlab: A Sandbox for Optimizing GEMM

This project introduced me to how to optimize GEMM
riscv平台优化矩阵乘(基于blislab优化实践)

I conduct experiments and exploration based on this project
Thanks to Mr. Ding for your guidance.