Graph-Guided Adaptive Channel Elimination for KV Cache Compression
Enwei Tong, Yao Zhu, Yuanchao Bai, Kai Wang, Xianming Liu, Xiangyang Ji

TL;DR
GRACE is a graph-based framework for KV cache compression in large language models, achieving 60% size reduction with minimal performance loss by modeling channel interactions and protecting salient channels.
Contribution
It introduces a novel graph-guided approach for adaptive channel elimination in KV cache compression, considering inter-channel interactions and saliency.
Findings
Reduces KV cache size by 60% with negligible performance loss.
Outperforms state-of-the-art methods in cache compression.
Models channel interactions as a graph for optimized pruning.
Abstract
Large Language Models have revolutionized natural language processing, achieving unprecedented success across a vast range of tasks. However, their practical application in long-context scenarios is severely hampered by the formidable memory footprint of the Key-Value cache. While channel pruning has emerged as a promising compression strategy, existing methods evaluate channel importance in isolation, fundamentally ignoring the inter-channel interactions that collectively dictate model performance. This oversight leads to suboptimal pruning decisions. To address this, we introduce \textbf{GRACE} (\textbf{GR}aph-guided \textbf{A}daptive \textbf{C}hannel \textbf{E}limination), a novel framework that reframes KV cache compression as a graph-based optimization problem. GRACE models channels as nodes and their interactions as weighted edges, enabling the identification of a near-optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
