QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM   Serving

Yujun Lin; Haotian Tang; Shang Yang; Zhekai Zhang and; Guangxuan Xiao; Chuang Gan; Song Han

arXiv:2405.04532·cs.CL·May 2, 2025·6 cites

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang and, Guangxuan Xiao, Chuang Gan, Song Han

PDF

Open Access 4 Repos 2 Models 1 Video

TL;DR

This paper introduces QServe, a GPU-optimized quantization system that significantly accelerates large language model inference by reducing precision and optimizing system design, achieving higher throughput and lower costs.

Contribution

The paper presents QoQ, a novel 4-bit quantization algorithm, and a system design that together enable efficient large-batch LLM serving on GPUs, surpassing existing methods.

Findings

01

QServe achieves up to 3.5x throughput improvement over TensorRT-LLM.

02

It reduces LLM serving costs by 3x.

03

QServe outperforms TensorRT-LLM on L40S GPU, matching A100 performance.

Abstract

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

MLSys'25 - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving· youtube

Taxonomy

TopicsWireless Power Transfer Systems

MethodsLib