QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang and, Guangxuan Xiao, Chuang Gan, Song Han

TL;DR
This paper introduces QServe, a GPU-optimized quantization system that significantly accelerates large language model inference by reducing precision and optimizing system design, achieving higher throughput and lower costs.
Contribution
The paper presents QoQ, a novel 4-bit quantization algorithm, and a system design that together enable efficient large-batch LLM serving on GPUs, surpassing existing methods.
Findings
QServe achieves up to 3.5x throughput improvement over TensorRT-LLM.
It reduces LLM serving costs by 3x.
QServe outperforms TensorRT-LLM on L40S GPU, matching A100 performance.
Abstract
Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
MLSys'25 - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving· youtube
Taxonomy
TopicsWireless Power Transfer Systems
MethodsLib
