Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra
Paul Scheffler, Florian Zaruba, Fabian Schuiki, Torsten Hoefler, Luca, Benini

TL;DR
This paper introduces a novel hardware extension for RISC-V to accelerate sparse-dense linear algebra operations, achieving significant speedups and energy efficiency improvements over existing CPU and GPU solutions.
Contribution
It presents a new memory-streaming ISA extension that enhances sparse-dense product computations, enabling high utilization and performance on CPUs and multi-core clusters.
Findings
Up to 80% FPU utilization with the hardware extension.
Speedups of up to 7.2x on single-core and 5.8x on multi-core clusters.
2.8x higher peak FP64 utilization compared to a GTX 1080 Ti GPU.
Abstract
Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multi-core cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
