Linear Predictability of Attention Heads in Large Language Models
Khalid Shaikh, Asmit Kumar Singh, Rebecca Christopher Dsouza, Shikhar Shiromani

TL;DR
This paper reveals a linear structure in attention-head activations of large language models, enabling efficient KV-cache reconstruction and reducing inference bottlenecks.
Contribution
It uncovers a learned linear predictability among attention heads in pretrained Transformers and demonstrates practical cache reduction techniques.
Findings
High-fidelity head prediction with 2-5 reference heads
Linear predictability emerges during pretraining
Achieves 2x KV-cache reduction with modest accuracy trade-offs
Abstract
Large language model (LLM) inference is increasingly bottlenecked by the Key-Value (KV) cache, yet the fine-grained structure of attention-head activations remains poorly understood. We show that pretrained Transformers exhibit a pervasive inter-head linear structure: for a given token, the Query, Key, and Value (QKV) vectors of an attention head can often be reconstructed as a linear combination of a small number of peer heads, typically within the same layer. Across Llama-3.1-8B, Falcon3-10B, OLMo-2-7B, and Qwen3-32B, just 2-5 reference heads recover many target heads with high fidelity (e.g., mean R^2 approx 0.76 for Keys on C4 with five references, and frequently R^2 > 0.85 on GSM8K). This predictability is learned rather than architectural: it is largely absent at random initialization, rises rapidly during pretraining as we track through OLMo-2 checkpoints, and is supported by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications
