Identify Critical KV Cache in LLM Inference from an Output Perturbation   Perspective

Yuan Feng; Junlin Lv; Yukun Cao; Xike Xie; S Kevin Zhou

arXiv:2502.03805·cs.CL·February 7, 2025

Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S Kevin Zhou

PDF

Open Access 2 Repos

TL;DR

This paper introduces a formal method to identify critical KV cache entries in large language models by analyzing output perturbations, leading to more effective cache pruning and improved inference efficiency.

Contribution

It provides a formal analysis of attention output perturbation to identify critical KV entries, surpassing empirical methods and enhancing cache eviction strategies.

Findings

01

Our algorithm outperforms existing cache eviction methods.

02

Achieves lower output perturbations in over 92% of attention heads.

03

Improves inference efficiency in Llama models.

Abstract

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower System Optimization and Stability · Power Systems and Technologies · Network Packet Processing and Optimization

MethodsSoftmax · Attention Is All You Need · Pruning · LLaMA