Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors
Weiying Wang, Zhiwei Zhang

TL;DR
This paper analyzes and optimizes microarchitectural inefficiencies in RISC-V vector processors, achieving significant throughput improvements without hardware changes by targeted microarchitectural enhancements.
Contribution
It introduces a microarchitectural optimization framework for RISC-V vector processors that significantly improves sustained throughput by addressing key bottlenecks.
Findings
Achieved a 1.33x speedup over baseline Ara processor.
Closed 12.2% of the performance gap to the theoretical bound.
Speedups of approximately 2.41x, 1.60x, 1.52x, and 1.42x on key kernels.
Abstract
Modern RISC vector processors rely on the synergy of multi-lane parallelism and chaining to achieve high sustained throughput, yet their achieved performance often falls substantially short of the theoretical performance bound due to microarchitectural inefficiencies. In this work, we take the open-source RVV processor Ara as the target platform and analyze the sources of its sustained-throughput loss and optimize the design accordingly. We first establish an ideal multi-lane chaining execution model as a microarchitectural reference for the ideal steady-state progression of the vector backend. Based on this model, we attribute Ara's key bottlenecks to inefficiencies along three critical execution paths: memory-side inefficiencies in data supply and transaction issuance, control-side inefficiencies caused by conservative dependence management and issue control, and operand-delivery…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
