TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Vinay Joshi; Pratik Prabhanjan Brahma; Zicheng Liu; Emad Barsoum

arXiv:2506.04642·cs.CL·June 6, 2025

TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

PDF

Open Access 1 Video

TL;DR

TaDA is a training-free KV cache compression method that adaptively quantizes and mean-centers activations, significantly reducing memory usage while maintaining accuracy, enabling scalable inference for large language models.

Contribution

Introducing TaDA, a novel training-free quantization approach that adaptively compresses KV caches without managing outliers separately, improving scalability and efficiency.

Findings

01

Reduces KV cache memory to 27% of baseline

02

Maintains accuracy across various models and context lengths

03

Eliminates need for separate outlier management in quantization

Abstract

The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Advanced Neural Network Applications