Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs
Severin Bochem, Victor J.B. Jung, Arpan Prasad, Francesco Conti, Luca, Benini

TL;DR
This paper presents a distributed inference methodology for Transformer models on low-power MCUs, significantly reducing off-chip traffic and enabling efficient on-device AI for wearable devices.
Contribution
It introduces a novel distributed system approach that partitions Transformer inference across multiple MCUs, minimizing off-chip traffic and enabling deployment on resource-constrained wearable devices.
Findings
Achieved 26.1x speedup with TinyLlama-42M on 8 MCUs
Reduced energy consumption to 0.64 mJ per inference
Improved Energy Delay Product by 27.2x
Abstract
Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Low-power high-performance VLSI design · Advanced Memory and Neural Computing
