SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
Mengjie Li, William J. Song

TL;DR
SurfaceLogicKV leverages attention behavior analysis to improve key-value cache compression in large language models, enhancing robustness and efficiency for long-sequence inference.
Contribution
The paper introduces a novel two-stage method that distinguishes surface memorization and logic construction in attention heads for effective KV cache compression.
Findings
Achieves improved robustness in KV cache compression.
Maintains competitive performance across various tasks.
Outperforms baselines and sometimes even FullKV.
Abstract
The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
