TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering
Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

TL;DR
TaDA is a training-free KV cache compression method that adaptively quantizes and mean-centers activations, significantly reducing memory usage while maintaining accuracy, enabling scalable inference for large language models.
Contribution
Introducing TaDA, a novel training-free quantization approach that adaptively compresses KV caches without managing outliers separately, improving scalability and efficiency.
Findings
Reduces KV cache memory to 27% of baseline
Maintains accuracy across various models and context lengths
Eliminates need for separate outlier management in quantization
Abstract
The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Advanced Neural Network Applications
