Optimizing Foundation Model Inference on a Many-tiny-core Open-source   RISC-V Platform

Viviane Potocnik; Luca Colagrande; Tim Fischer; Luca Bertaccini,; Daniele Jahier Pagliari; Alessio Burrello; Luca Benini

arXiv:2405.19284·cs.DC·May 30, 2024

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini,, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

PDF

Open Access

TL;DR

This paper demonstrates the first end-to-end inference of transformer models on an open-source RISC-V platform, achieving significant speedups and efficiency improvements over baseline and state-of-the-art accelerators.

Contribution

It introduces optimized transformer inference on a many-tiny-core RISC-V platform using ISA extensions, specialized DMA, and distributed primitives, a novel approach in this domain.

Findings

01

Up to 12.8x speedup for encoder-only models

02

Over 79% FPU utilization and 294 GFLOPS/W efficiency

03

2.04x higher FPU utilization compared to state-of-the-art accelerators

Abstract

Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems

MethodsSoftmax · Dual Multimodal Attention · Focus