Retrieval Head Mechanistically Explains Long-Context Factuality
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu

TL;DR
This paper identifies and characterizes special attention heads in long-context transformer models, called retrieval heads, which are crucial for retrieving relevant information and influence reasoning and hallucination.
Contribution
It systematically reveals the properties and importance of retrieval heads in long-context models, providing mechanistic insights into information retrieval within transformers.
Findings
Retrieval heads are universal, sparse, intrinsic, and dynamically activated.
Pruning retrieval heads impairs information retrieval and increases hallucination.
Retrieval heads significantly impact chain-of-thought reasoning tasks.
Abstract
Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information…
Peer Reviews
Decision·ICLR 2025 Oral
- The paper is well-written, with a clear overarching research question (”How do transformer-based language models acquire long-context capabilities?”), the substantial findings of the existence of “retrieval heads” and their properties, and experiments to support each finding. - The experiments are extensively conducted on LLaMA, Yi, Qwen, and Mistral model families, at various scales from 6B to 8x7B, on base and chat models, and model dense and Mixture-of-Experts models. - To identify which at
- It’d facilitate reading to clarify that "$k$ is a sentence that is irrelevant to $x$" in L146, instead of, for example, a short phrase or a single word. Can add a reference to Figure 2 so that readers see an example. - The paper misses dataset details (L195, L355, L427). Are NIAH samples created manually or by prompting large language models? What datasets are used to begin with in Sec 2-4? What additional evaluation tests did you create in Sec 4.1-4.3? - The paper misses experimental details,
Overall, the paper is well-structured, the assumptions are clear, and, with a few exceptions listed in the "weaknesses", the methodology is clear. The results and experiments robustly support the authors’ claims.
The main concern with the paper is the lack of details on the benchmark used to generate the needle-and-haystack pairs. The paper does not clarify how these pairs are created, the diversity of pairs (e.g., in topic, token variety), or the validation methods used. Including these details would provide reviewers with valuable insights into the experimental design, helping them better assess the generalizability and universality of the findings. Additional minor points: 1) The acronym "KV" is not
1. Provides in-depth analysis and extensive experiments on various properties of retrieval heads. 2. Well-written, with clear graphics, easy-to-follow explanations, and a well-organized structure. 3. Numerous examples and case studies effectively illustrate the properties, making the concepts easy to understand.
Nothing major needs to be addressed. Please address some discussion questions in question sections.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Memory Processes and Influences
MethodsSparse Evolutionary Training · Pruning
