Outlier Smoothing with Closed-Form Rotations for W4A4 Large Language Model Quantization
Jinying Xiao, Bin Ji, Shasha Li, Xiaodong Liu, Ma Jun, Chao Wang, Wei Li, Ye Zhong, Xuan Xie, Nyima Tashi, Jie Yu

TL;DR
This paper introduces SingleQuant, a novel quantization framework for large language models that uses closed-form rotations to improve efficiency and performance, addressing convergence issues in existing methods.
Contribution
SingleQuant decouples quantization from truncation using geometric rotations, enabling faster and more accurate LLM quantization with theoretical and empirical validation.
Findings
SingleQuant achieves 1,400× speedup in quantization of LLaMA-2-13B.
It improves average task performance by +0.57% over baselines.
Experimental results on 7B-70B LLMs demonstrate superior performance and efficiency.
Abstract
Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs' task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
