Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S Kevin Zhou

TL;DR
This paper introduces a formal method to identify critical KV cache entries in large language models by analyzing output perturbations, leading to more effective cache pruning and improved inference efficiency.
Contribution
It provides a formal analysis of attention output perturbation to identify critical KV entries, surpassing empirical methods and enhancing cache eviction strategies.
Findings
Our algorithm outperforms existing cache eviction methods.
Achieves lower output perturbations in over 92% of attention heads.
Improves inference efficiency in Llama models.
Abstract
Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower System Optimization and Stability · Power Systems and Technologies · Network Packet Processing and Optimization
MethodsSoftmax · Attention Is All You Need · Pruning · LLaMA
