On the Limits of Learned Importance Scoring for KV Cache Compression
Brady Steele

TL;DR
This paper evaluates learned importance scoring for KV cache compression in language models, finding simple heuristics often outperform complex learned methods due to limited information in representations and inherent circular dependencies.
Contribution
It introduces SIP, a learned scorer for importance prediction, and demonstrates its limitations compared to simple heuristics and prefill attention across multiple tasks.
Findings
Position-based heuristics match or outperform learned scorers.
Prefill attention provides signals comparable to complex models.
Limited information in KV representations constrains importance prediction.
Abstract
We investigate learned KV cache compression through Speculative Importance Prediction (SIP), a 1.7M parameter non-query-aware scorer that predicts token importance from KV representations alone. Despite architectural sophistication (multi-horizon lookahead, cross-attention), SIP does not outperform simple baselines, including random selection, across 5 seeds, 4 retention levels, and 3 tasks. Key findings: (1) position-based heuristics (keep first 4 + last N tokens) match or exceed learned approaches; (2) prefill attention provides equivalent signal to complex learned scorers; (3) marginal information in KV representations beyond position and prefill attention appears limited for importance prediction. We hypothesize that circular dependence between future queries and generation trajectories contributes to this difficulty.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
