EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Shiyu Ji; Yixuan Wang; Yijun Liu; Qingfu Zhu; Wanxiang Che

arXiv:2603.22910·cs.CL·May 14, 2026

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che

PDF

TL;DR

EchoKV introduces a flexible, similarity-based KV cache compression method for LLMs that reduces memory usage while maintaining inference speed, enabling on-demand switching between full and compressed caches.

Contribution

It proposes a novel, lightweight network-based compression framework with a quick fine-tuning strategy that outperforms existing methods in memory efficiency and flexibility.

Findings

01

Outperforms existing compression methods across multiple ratios and models.

02

Maintains throughput of full-cache inference in short-context scenarios.

03

Requires only a few minutes of fine-tuning on a single GPU.

Abstract

The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank KV compression methods reduce this footprint by modifying model projections, limiting the flexibility to switch back to standard full-cache inference when sufficient memory is available. In this paper, we propose EchoKV, a flexible KV cache compression framework that supports on-demand transitions from full KV caching to compressed caching. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the discarded KV components from a partial subset, exploiting intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a lightweight two-stage fine-tuning strategy, requiring only a few minutes on a single A100 GPU for a 7B model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.