Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan; Ayan Sengupta; Tanmoy Chakraborty

arXiv:2603.01426·cs.CL·March 3, 2026

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

PDF

Open Access

TL;DR

This paper investigates how key-value cache compression affects large language models by analyzing attention dynamics, revealing structural properties, redundancy, and phase transitions that influence model robustness and scalability.

Contribution

It introduces a physics-inspired framework to understand KV compression as a perturbation of attention routing, uncovering structural insights and resilience profiles across architectures.

Findings

01

Moderate compression causes minimal accuracy loss but reveals redundancy.

02

A sharp hallucination safety cliff occurs near 90% compression, linked to phase transition.

03

Different architectures exhibit distinct routing dynamics and resilience profiles.

Abstract

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance