Quantized Transformer Language Model Implementations on Edge Devices
Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening,, Salim Hariri, Sicong Shao, Pratik Satam, and Soheil Salehi

TL;DR
This paper demonstrates how quantized MobileBERT models can be efficiently deployed on edge devices, significantly reducing model size and latency while maintaining acceptable accuracy for NLP tasks, emphasizing privacy benefits.
Contribution
The study introduces a method for converting and quantizing large transformer models into a FlatBuffer format optimized for edge deployment, with comprehensive performance evaluation.
Findings
MobileBERT models are 160× smaller than BERT large with 4.1% accuracy drop.
Edge deployment achieves at least one tweet analysis per second.
Models maintain privacy by processing data locally in TinyML systems.
Abstract
Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications, wherein these models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency. In order to overcome these limitations, such large-scale models can be converted to an optimized FlatBuffer format, tailored for deployment on resource-constrained edge devices. Herein, we evaluate the performance of such FlatBuffer transformed MobileBERT models on three different edge devices, fine-tuned for Reputation analysis of English language tweets in the RepLab 2013 dataset. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Power Systems and Technologies
MethodsAttention Is All You Need · Softmax · Dropout · WordPiece · Attention Dropout · Dense Connections · Adam · Residual Connection · Layer Normalization · Weight Decay
