XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar; Coleman Hooper; Minjae Lee; Haocheng Xi; Rishabh Tiwari; Wonjun Kang; Luca Manolache; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

arXiv:2508.10395·cs.LG·August 15, 2025

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

PDF

TL;DR

XQuant introduces a novel method for LLM inference that significantly reduces memory usage by quantizing and rematerializing key-value caches, leveraging cross-layer similarities to achieve near-FP16 accuracy with substantial memory savings.

Contribution

The paper presents XQuant, a new approach that rematerializes KV caches with low-bit quantization, enabling up to 12.5× memory savings with minimal accuracy loss.

Findings

01

Achieves up to 7.7× memory savings with <0.1 perplexity degradation.

02

XQuant-CL attains up to 12.5× memory savings with 0.1 perplexity degradation.

03

Outperforms state-of-the-art KV cache quantization methods.

Abstract

Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.