DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization
Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, Jian Cheng

TL;DR
DartQuant introduces an efficient rotational calibration method for large language model quantization, significantly reducing computational costs and enabling resource-constrained environments to perform high-quality model compression.
Contribution
It proposes a distribution-aware rotational calibration technique and QR-Orth optimization, reducing complexity and resource requirements for large model quantization.
Findings
Achieves 47× acceleration in rotational optimization.
Saves 10× memory compared to existing methods.
Enables quantization of 70B models on a single GPU.
Abstract
Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
