TL;DR
CodeQuant introduces a unified clustering and quantization method that effectively reduces outlier-induced errors in low-precision MoE models, leading to significant speedups and improved accuracy.
Contribution
It proposes a novel scheme combining learnable rotation and clustering to smooth outliers, enhancing low-precision deployment of large language models.
Findings
Achieves up to 4.15x speedup on hardware.
Delivers higher accuracy than existing quantization methods.
Effectively reduces quantization errors in MoE models.
Abstract
Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
