MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
Seoungsub Lee, In Seo Kim, and Seon Wook Kim

TL;DR
MUXQ introduces a novel quantization method that detects and redistributes outlier activation channels in LLMs, enabling low-precision INT quantization with minimal accuracy loss for efficient on-device inference.
Contribution
It proposes a low-rank outlier decomposition technique to improve activation quantization in LLMs, addressing hardware inefficiencies caused by outliers.
Findings
MUXQ achieves lower perplexity on GPT-2 models compared to naive quantization.
It enables INT8 quantization of activations and weights with accuracy close to FP16.
MUXQ maintains stable low-precision inference with modest computational overhead.
Abstract
Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is therefore essential. However, existing methods, including ZeroQuant, LLM.int8(), and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies. To overcome these limitations, we propose MUXQ (Mixed-to-Uniform Quantization). MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, thereby alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
