QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching
Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang

TL;DR
QuEPT introduces a novel post-training quantization method for Transformers that enables efficient multi-bit deployment with one-shot calibration, dynamic bit-width adaptation, and improved robustness through token merging and low-rank adapters.
Contribution
The paper presents QuEPT, a new elastic quantization scheme for Transformers that supports real-time switching between quantization precisions with minimal calibration, enhancing efficiency and robustness.
Findings
Achieves comparable or superior performance to state-of-the-art methods.
Supports real-time switching between uniform and mixed precision quantization.
Demonstrates effectiveness on large language models.
Abstract
Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Data Compression Techniques
