Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration

Luca Colagrande; Lorenzo Leone; Maximilian Coco; Andrei Deaconeasa; Luca Benini

arXiv:2506.10921·cs.AR·June 13, 2025

Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration

Luca Colagrande, Lorenzo Leone, Maximilian Coco, Andrei Deaconeasa, Luca Benini

PDF

Open Access

TL;DR

This paper presents microarchitectural innovations for RISC-V clusters that nearly eliminate control and memory access inefficiencies in matrix multiplication, significantly boosting performance and energy efficiency for ML workloads.

Contribution

It introduces zero-overhead loop nests and a conflict-free memory system to optimize RISC-V clusters for ML acceleration, maintaining programmability and broad workload support.

Findings

01

Achieved 96.1% to 99.4% utilization in matrix multiplication workloads.

02

Realized 11% performance and 8% energy efficiency improvements over baseline.

03

Maintained comparable efficiency to specialized accelerators with full programmability.

Abstract

The growing computational demands of machine learning (ML) workloads have driven the design of ML accelerators aiming at an optimal tradeoff between efficiency and flexibility. A widely explored architecture for flexible ML accelerators is based on clusters of lightweight instruction processors sharing multi-banked L1 memory, augmented with specialized instruction extensions for key ML-related computations, such as matrix multiplication (matmul). However, instruction extensions should be coupled with microarchitectural optimizations that remove inefficiencies due to control flow (loop handling) and memory access, without drastically increasing processor complexity. Moving from a state-of-the-art (SoA) ML accelerator cluster based on RISC-V processors, we propose a low-overhead optimized microarchitecture that eliminates these inefficiencies almost entirely while retaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Low-power high-performance VLSI design · Big Data and Digital Economy