Deeploy: Enabling Energy-Efficient Deployment of Small Language Models   On Heterogeneous Microcontrollers

Moritz Scherer; Luka Macan; Victor Jung; Philip Wiese; Luca Bompani,; Alessio Burrello; Francesco Conti; Luca Benini

arXiv:2408.04413·cs.LG·August 9, 2024

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Moritz Scherer, Luka Macan, Victor Jung, Philip Wiese, Luca Bompani,, Alessio Burrello, Francesco Conti, Luca Benini

PDF

Open Access

TL;DR

Deeploy is a novel compiler that enables efficient deployment of small language models on microcontrollers by optimizing energy use and throughput without external memory, advancing edge AI capabilities.

Contribution

The paper introduces Deeploy, a DNN compiler that automates the deployment of SLMs on heterogeneous microcontrollers, optimizing performance and energy efficiency.

Findings

01

Achieved 490 microjoules per token energy consumption.

02

Reached 340 tokens per second throughput.

03

First deployment of SLMs on MCU without external memory.

Abstract

With the rise of Embodied Foundation Models (EFMs), most notably Small Language Models (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Topic Modeling · Parallel Computing and Optimization Techniques