Distributed Inference with Minimal Off-Chip Traffic for Transformers on   Low-Power MCUs

Severin Bochem; Victor J.B. Jung; Arpan Prasad; Francesco Conti; Luca; Benini

arXiv:2412.04372·cs.AR·March 27, 2025

Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Severin Bochem, Victor J.B. Jung, Arpan Prasad, Francesco Conti, Luca, Benini

PDF

Open Access

TL;DR

This paper presents a distributed inference methodology for Transformer models on low-power MCUs, significantly reducing off-chip traffic and enabling efficient on-device AI for wearable devices.

Contribution

It introduces a novel distributed system approach that partitions Transformer inference across multiple MCUs, minimizing off-chip traffic and enabling deployment on resource-constrained wearable devices.

Findings

01

Achieved 26.1x speedup with TinyLlama-42M on 8 MCUs

02

Reduced energy consumption to 0.64 mJ per inference

03

Improved Energy Delay Product by 27.2x

Abstract

Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterconnection Networks and Systems · Low-power high-performance VLSI design · Advanced Memory and Neural Computing