RUQuant: Towards Refining Uniform Quantization for Large Language Models
Han Liu, Haotian Gao, Changya Li, Feng Zhang, Xiaotong Zhang, Wei Wang, Hong Yu

TL;DR
RUQuant introduces a theoretically grounded, two-stage orthogonal transformation method for uniform activation quantization in large language models, significantly reducing accuracy loss without retraining.
Contribution
It proposes a novel orthogonal transformation approach based on Lloyd-Max optimality to improve uniform quantization of activations in LLMs.
Findings
Achieves 99.8% of full-precision accuracy with W6A6 quantization.
Attains 97% accuracy with W4A4 quantization for a 13B LLM.
Operates within approximately one minute without model fine-tuning.
Abstract
The increasing size and complexity of large language models (LLMs) have raised significant challenges in deployment efficiency, particularly under resource constraints. Post-training quantization (PTQ) has emerged as a practical solution by compressing models without requiring retraining. While existing methods focus on uniform quantization schemes for both weights and activations, they often suffer from substantial accuracy degradation due to the non-uniform nature of activation distributions. In this work, we revisit the activation quantization problem from a theoretical perspective grounded in the Lloyd-Max optimality conditions. We identify the core issue as the non-uniform distribution of activations within the quantization interval, which causes the optimal quantization point under the Lloyd-Max criterion to shift away from the midpoint of the interval. To address this issue, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
