Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform
Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini,, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

TL;DR
This paper demonstrates the first end-to-end inference of transformer models on an open-source RISC-V platform, achieving significant speedups and efficiency improvements over baseline and state-of-the-art accelerators.
Contribution
It introduces optimized transformer inference on a many-tiny-core RISC-V platform using ISA extensions, specialized DMA, and distributed primitives, a novel approach in this domain.
Findings
Up to 12.8x speedup for encoder-only models
Over 79% FPU utilization and 294 GFLOPS/W efficiency
2.04x higher FPU utilization compared to state-of-the-art accelerators
Abstract
Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems
MethodsSoftmax · Dual Multimodal Attention · Focus
