OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
Zhiyuan Zhang, Yanzhao Li, Zhiqiang Zou, Bai Du, Yupeng Sun, Hui Dong, Hui Wang

TL;DR
This paper introduces OSC, a hardware-efficient 4-bit quantization method for large language models that suppresses activation outliers through structured channel separation, improving accuracy and speed.
Contribution
OSC presents a novel outlier suppression framework that combines dual-path computation and structured channel coalescence for efficient low-bit model deployment.
Findings
Achieves only 2.19 and 1.12 point accuracy drops on Qwen models.
Realizes up to 1.78x speedup over W8A8 baseline.
Effectively suppresses activation outliers in 4-bit quantization.
Abstract
While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
