MobileQuant: Mobile-friendly Quantization for On-device Language Models
Fuwen Tan, Royson Lee, {\L}ukasz Dudziak, Shell Xu Hu, Sourav, Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

TL;DR
MobileQuant introduces a simple, effective post-training quantization method that enables 8-bit activation quantization for on-device large language models, significantly reducing latency and energy consumption while maintaining accuracy.
Contribution
It is the first to facilitate 8-bit activation quantization for LLMs on edge devices through joint optimization of weight transformation and activation range.
Findings
Achieves near-lossless quantization across various LLM benchmarks.
Reduces latency and energy consumption by 20-50%.
Requires limited compute and is compatible with NPUs.
Abstract
Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Recommender Systems and Techniques · Topic Modeling
MethodsFocus
