TL;DR
This paper introduces a split inference system for CNNs on networked microcontrollers, enabling collaborative execution across multiple devices to overcome memory constraints while maintaining latency.
Contribution
It proposes a novel sub-layer splitting approach for CNN inference on MCUs, distributing model parameters and activations across devices to reduce peak RAM usage.
Findings
Enables CNN inference on multiple MCUs previously infeasible on a single device.
Reduces peak RAM usage per MCU while maintaining inference latency.
Successfully tested with MobileNetV2 on up to 8 MCUs.
Abstract
Running deep neural networks on microcontroller units (MCUs) is severely constrained by limited memory resources. While TinyML techniques reduce model size and computation, they often fail in practice due to excessive peak Random Access Memory (RAM) usage during inference, dominated by intermediate activations. As a result, many models remain infeasible on standalone MCUs. In this work, we present a fine-grained split inference system for networked MCUs that enables collaborative inference of Convolutional Neural Networks (CNN) models across multiple devices. Our key insight is that breaking the memory bottleneck requires splitting inference at sub-layer granularity rather than at layer boundaries. We reinterpret pre-trained models to enable kernel-wise and neuron-wise partitioning, and distribute both model parameters and intermediate activations across multiple MCUs. A lightweight,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
