One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Liming Lu; Kaixi Qiu; Jiayu Zhou; Jushi Kai; Haoyan Zhang; Huanyu Wang; Jingwen Leng; Ziwei He; Zhouhan Lin

arXiv:2603.04411·cs.CL·March 6, 2026

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin

PDF

Open Access

TL;DR

DynaKV is a post-training, token-wise adaptive compression method for KV caches in large language models, significantly reducing memory usage while preserving performance by dynamically allocating compression rates based on token semantics.

Contribution

It introduces DynaKV, the first dynamic, token-wise adaptive compression framework for KV caches that improves fidelity at high compression ratios without retraining.

Findings

01

Retains only 6% of KV cache while maintaining 94% performance.

02

Outperforms existing compression techniques in memory reduction and quality.

03

Compatible with sequence-level pruning methods.

Abstract

Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Natural Language Processing Techniques