EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che

TL;DR
EchoKV introduces a flexible, similarity-based KV cache compression method for LLMs that reduces memory usage while maintaining inference speed, enabling on-demand switching between full and compressed caches.
Contribution
It proposes a novel, lightweight network-based compression framework with a quick fine-tuning strategy that outperforms existing methods in memory efficiency and flexibility.
Findings
Outperforms existing compression methods across multiple ratios and models.
Maintains throughput of full-cache inference in short-context scenarios.
Requires only a few minutes of fine-tuning on a single GPU.
Abstract
The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank KV compression methods reduce this footprint by modifying model projections, limiting the flexibility to switch back to standard full-cache inference when sufficient memory is available. In this paper, we propose EchoKV, a flexible KV cache compression framework that supports on-demand transitions from full KV caching to compressed caching. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the discarded KV components from a partial subset, exploiting intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a lightweight two-stage fine-tuning strategy, requiring only a few minutes on a single A100 GPU for a 7B model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
