MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers
Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng, Wang, Ru Huang, Meng Li

TL;DR
MCUBERT introduces a memory-efficient approach to run BERT-based language models on microcontrollers by combining network optimization, embedding compression, and scheduling strategies, enabling longer sequences with minimal latency.
Contribution
This work presents the first method to enable lightweight BERT models on commodity microcontrollers through network and scheduling co-optimization techniques.
Findings
Reduces BERT-tiny and BERT-mini size by over 3-5 times.
Supports processing over 512 tokens on MCUs with less than 256KB memory.
Achieves 1.5 times latency reduction compared to baseline.
Abstract
In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7 and 3.0 and the execution memory by 3.5 and 4.3, respectively. MCUBERT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Dense Connections · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Weight Decay · Adam · Attention Dropout
