FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Hao Zhang; Aining Jia; Weifeng Bu; Yushu Cai; Kai Sheng; Hao Chen; Xin He

arXiv:2508.04405·cs.LG·November 4, 2025

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He

PDF

TL;DR

FlexQ introduces a novel post-training INT6 quantization method for large language models, combining algorithmic strategies with system-level GPU optimizations to achieve near-FP16 accuracy and significant speedups.

Contribution

FlexQ is the first to enable efficient INT6 quantization for LLMs through combined algorithm-system co-design, including a specialized GPU kernel supporting W6A6 and W6A8 formats.

Findings

01

Achieves near-FP16 accuracy with minimal perplexity increase.

02

Provides 1.39× speedup over existing methods on LLaMA-2-70B.

03

Reduces memory usage by 1.21× compared to SmoothQuant.

Abstract

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.