SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia; Jisen Li; Zhongzhu Zhou; Jung Hwan Heo; Jue Wang; Tri Dao; Shuaiwen Leon Song; Ben Athiwaratkun; Chenfeng Xu; Tianyi Zhang; Xiaoxia Wu

arXiv:2604.19157·cs.LG·April 22, 2026

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, Xiaoxia Wu

PDF

TL;DR

This paper presents a system-aware 4-bit KV-cache quantization method using token-wise INT4 with Hadamard rotation, optimizing accuracy and efficiency for real-world LLM serving under practical constraints.

Contribution

It introduces a minimal, practical 4-bit quantization approach with a fused kernel that maintains high accuracy and throughput in real serving environments.

Findings

01

Token-wise INT4 with Hadamard rotation achieves near-lossless accuracy.

02

The method integrates into paged KV-cache layouts with zero overhead.

03

It matches plain INT4 throughput across various concurrency levels.

Abstract

KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design--token-wise INT4 quantization with block-diagonal Hadamard rotation--consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.