CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

Xiangyang Yin; Xingyu Liu; Tianhua Xia; Bo Bao; Vithursan Thangarasa; Valavan Manohararajah; Eric Sather; Sai Qian Zhang

arXiv:2604.10496·cs.LG·April 14, 2026

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

Xiangyang Yin, Xingyu Liu, Tianhua Xia, Bo Bao, Vithursan Thangarasa, Valavan Manohararajah, Eric Sather, Sai Qian Zhang

PDF

1 Repo 1 Video

TL;DR

CodeQuant introduces a unified clustering and quantization method that effectively reduces outlier-induced errors in low-precision MoE models, leading to significant speedups and improved accuracy.

Contribution

It proposes a novel scheme combining learnable rotation and clustering to smooth outliers, enhancing low-precision deployment of large language models.

Findings

01

Achieves up to 4.15x speedup on hardware.

02

Delivers higher accuracy than existing quantization methods.

03

Effectively reduces quantization errors in MoE models.

Abstract

Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SAI-Lab-NYU/CodeQuant
github

Videos

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts· slideslive