Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers
Moritz Scherer, Luka Macan, Victor Jung, Philip Wiese, Luca Bompani,, Alessio Burrello, Francesco Conti, Luca Benini

TL;DR
Deeploy is a novel compiler that enables efficient deployment of small language models on microcontrollers by optimizing energy use and throughput without external memory, advancing edge AI capabilities.
Contribution
The paper introduces Deeploy, a DNN compiler that automates the deployment of SLMs on heterogeneous microcontrollers, optimizing performance and energy efficiency.
Findings
Achieved 490 microjoules per token energy consumption.
Reached 340 tokens per second throughput.
First deployment of SLMs on MCU without external memory.
Abstract
With the rise of Embodied Foundation Models (EFMs), most notably Small Language Models (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Topic Modeling · Parallel Computing and Optimization Techniques
