Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference

Vineet Kumar; Ajay Kumar M; Yike Li; Shreejith Shanker; Deepu John (School of Electrical; Electronic Engineering; University College Dublin; Dublin; Ireland; Department of Electronic; Electrical Engineering; Trinity College Dublin; Dublin; Ireland)

arXiv:2508.16095·cs.AR·November 19, 2025

Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference

Vineet Kumar, Ajay Kumar M, Yike Li, Shreejith Shanker, Deepu John (School of Electrical, Electronic Engineering, University College Dublin, Dublin, Ireland, Department of Electronic, Electrical Engineering, Trinity College Dublin, Dublin, Ireland)

PDF

TL;DR

This paper introduces a specialized SoC combining RISC-V and NVDLA for efficient deep learning inference at the edge, utilizing a bare-metal software flow to optimize speed and storage.

Contribution

It presents a novel tightly coupled RISC-V and NVDLA architecture with a bare-metal toolflow, improving inference speed and efficiency for edge deep learning applications.

Findings

01

Inference times: 4.8 ms for LeNet-5, 16.2 ms for ResNet-18, 1.1 s for ResNet-50.

02

Achieved on AMD ZCU102 FPGA with 100 MHz system clock.

03

Bare-metal approach reduces OS overheads, enhancing performance.

Abstract

This paper presents a novel System-on-Chip (SoC) architecture for accelerating complex deep learning models for edge computing applications through a combination of hardware and software optimisations. The hardware architecture tightly couples the open-source NVIDIA Deep Learning Accelerator (NVDLA) to a 32-bit, 4-stage pipelined RISC-V core from Codasip called uRISC_V. To offload the model acceleration in software, our toolflow generates bare-metal application code (in assembly), overcoming complex OS overheads of previous works that have explored similar architectures. This tightly coupled architecture and bare-metal flow leads to improvements in execution speed and storage efficiency, making it suitable for edge computing solutions. We evaluate the architecture on AMD's ZCU102 FPGA board using NVDLA-small configuration and test the flow using LeNet-5, ResNet-18 and ResNet-50 models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.