MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Zebin Yang; Renze Chen; Taiqiang Wu; Ngai Wong; Yun Liang; Runsheng; Wang; Ru Huang; Meng Li

arXiv:2410.17957·cs.LG·October 24, 2024

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng, Wang, Ru Huang, Meng Li

PDF

TL;DR

MCUBERT introduces a memory-efficient approach to run BERT-based language models on microcontrollers by combining network optimization, embedding compression, and scheduling strategies, enabling longer sequences with minimal latency.

Contribution

This work presents the first method to enable lightweight BERT models on commodity microcontrollers through network and scheduling co-optimization techniques.

Findings

01

Reduces BERT-tiny and BERT-mini size by over 3-5 times.

02

Supports processing over 512 tokens on MCUs with less than 256KB memory.

03

Achieves 1.5 times latency reduction compared to baseline.

Abstract

In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7 $\times$ and 3.0 $\times$ and the execution memory by 3.5 $\times$ and 4.3 $\times$ , respectively. MCUBERT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Dense Connections · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Weight Decay · Adam · Attention Dropout