FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He

TL;DR
FlexQ introduces a novel post-training INT6 quantization method for large language models, combining algorithmic strategies with system-level GPU optimizations to achieve near-FP16 accuracy and significant speedups.
Contribution
FlexQ is the first to enable efficient INT6 quantization for LLMs through combined algorithm-system co-design, including a specialized GPU kernel supporting W6A6 and W6A8 formats.
Findings
Achieves near-FP16 accuracy with minimal perplexity increase.
Provides 1.39× speedup over existing methods on LLaMA-2-70B.
Reduces memory usage by 1.21× compared to SmoothQuant.
Abstract
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
