Linear Predictability of Attention Heads in Large Language Models

Khalid Shaikh; Asmit Kumar Singh; Rebecca Christopher Dsouza; Shikhar Shiromani

arXiv:2603.13314·cs.LG·March 17, 2026

Linear Predictability of Attention Heads in Large Language Models

Khalid Shaikh, Asmit Kumar Singh, Rebecca Christopher Dsouza, Shikhar Shiromani

PDF

Open Access

TL;DR

This paper reveals a linear structure in attention-head activations of large language models, enabling efficient KV-cache reconstruction and reducing inference bottlenecks.

Contribution

It uncovers a learned linear predictability among attention heads in pretrained Transformers and demonstrates practical cache reduction techniques.

Findings

01

High-fidelity head prediction with 2-5 reference heads

02

Linear predictability emerges during pretraining

03

Achieves 2x KV-cache reduction with modest accuracy trade-offs

Abstract

Large language model (LLM) inference is increasingly bottlenecked by the Key-Value (KV) cache, yet the fine-grained structure of attention-head activations remains poorly understood. We show that pretrained Transformers exhibit a pervasive inter-head linear structure: for a given token, the Query, Key, and Value (QKV) vectors of an attention head can often be reconstructed as a linear combination of a small number of peer heads, typically within the same layer. Across Llama-3.1-8B, Falcon3-10B, OLMo-2-7B, and Qwen3-32B, just 2-5 reference heads recover many target heads with high fidelity (e.g., mean R^2 approx 0.76 for Keys on C4 with five references, and frequently R^2 > 0.85 on GSM8K). This predictability is learned rather than architectural: it is largely absent at random initialization, rises rapidly during pretraining as we track through OLMo-2 checkpoints, and is supported by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications