OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs
Shaoyuan Chen, Zhixuan Chen, Dawei Yang, Zhihang Yuan, Qiang Wu

TL;DR
OTARo introduces a novel quantization method with shared exponent floating point and a learning process that enables on-device LLMs to switch between different precisions seamlessly while maintaining robustness.
Contribution
The paper proposes OTARo, a new approach that allows flexible precision switching in on-device LLMs through shared exponent quantization and a robust training process.
Findings
OTARo achieves strong performance across multiple precisions.
The method maintains robustness in diverse downstream tasks.
Experiments on LLaMA models validate effectiveness.
Abstract
Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Topic Modeling
