CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context   Scenarios

Luning Wang; Shiyao Li; Xuefei Ning; Zhihang Yuan; Shengen Yan; Guohao; Dai; Yu Wang

arXiv:2409.10593·cs.LG·October 22, 2024

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao, Dai, Yu Wang

PDF

Open Access 1 Repo

TL;DR

CSKV introduces a low-rank, channel-shrinking method for KV cache compression in LLMs, significantly reducing memory overhead with minimal performance loss and low training costs.

Contribution

The paper proposes a novel, training-efficient channel shrinking technique using low-rank decomposition for KV cache compression in long-context LLMs, achieving up to 95% memory reduction.

Findings

01

Reduces KV cache memory by 80% while maintaining performance

02

Achieves up to 95% compression when combined with quantization

03

Maintains long-context capabilities with minimal retraining effort

Abstract

Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wln20/CSKV
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Wireless Network Optimization · Caching and Content Delivery · IPv6, Mobility, Handover, Networks, Security

MethodsFocus