Performance optimization of BLAS algorithms with band matrices for RISC-V processors
Anna Pirova, Anastasia Vodeneeva, Konstantin Kovalev, Alexander Ustinov, Evgeny Kozinov, Alexey Liniov, Valentin Volokitin, Iosif Meyerov

TL;DR
This paper explores optimizing BLAS algorithms for band matrices on RISC-V processors, demonstrating significant speedups through improved vectorization and specific RISC-V features, with experiments on embedded devices.
Contribution
It presents the first detailed analysis and optimization of BLAS band matrix algorithms on RISC-V, highlighting effective vectorization techniques and performance gains.
Findings
Speedups of 1.5x to 10x over baseline implementations.
Effective use of RISC-V vector register grouping for performance.
Successful optimization on embedded RISC-V devices.
Abstract
The rapid development of RISC-V instruction set architecture presents new opportunities and challenges for software developers. Is it sufficient to simply recompile high-performance software optimized for x86-64 onto RISC-V CPUs? Are current compilers capable of effectively optimizing C and C++ codes or is it necessary to use intrinsics or assembler? Can we analyze and improve performance without well-developed profiling tools? Do standard optimization techniques work? Are there specific RISC-V features that need to be considered? These and other questions require careful consideration. In this paper, we present our experience optimizing four BLAS algorithms for band matrix operations on RISC-V processors. We demonstrate how RISC-V-optimized implementations of OpenBLAS algorithms can be significantly accelerated through improved vectorization of computationally intensive loops.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
