Retrieval Head Mechanistically Explains Long-Context Factuality

Wenhao Wu; Yizhong Wang; Guangxuan Xiao; Hao Peng; Yao Fu

arXiv:2404.15574·cs.CL·April 25, 2024·1 cites

Retrieval Head Mechanistically Explains Long-Context Factuality

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper identifies and characterizes special attention heads in long-context transformer models, called retrieval heads, which are crucial for retrieving relevant information and influence reasoning and hallucination.

Contribution

It systematically reveals the properties and importance of retrieval heads in long-context models, providing mechanistic insights into information retrieval within transformers.

Findings

01

Retrieval heads are universal, sparse, intrinsic, and dynamically activated.

02

Pruning retrieval heads impairs information retrieval and increases hallucination.

03

Retrieval heads significantly impact chain-of-thought reasoning tasks.

Abstract

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

- The paper is well-written, with a clear overarching research question (”How do transformer-based language models acquire long-context capabilities?”), the substantial findings of the existence of “retrieval heads” and their properties, and experiments to support each finding. - The experiments are extensively conducted on LLaMA, Yi, Qwen, and Mistral model families, at various scales from 6B to 8x7B, on base and chat models, and model dense and Mixture-of-Experts models. - To identify which at

Weaknesses

- It’d facilitate reading to clarify that "$k$ is a sentence that is irrelevant to $x$" in L146, instead of, for example, a short phrase or a single word. Can add a reference to Figure 2 so that readers see an example. - The paper misses dataset details (L195, L355, L427). Are NIAH samples created manually or by prompting large language models? What datasets are used to begin with in Sec 2-4? What additional evaluation tests did you create in Sec 4.1-4.3? - The paper misses experimental details,

Reviewer 02Rating 8Confidence 3

Strengths

Overall, the paper is well-structured, the assumptions are clear, and, with a few exceptions listed in the "weaknesses", the methodology is clear. The results and experiments robustly support the authors’ claims.

Weaknesses

The main concern with the paper is the lack of details on the benchmark used to generate the needle-and-haystack pairs. The paper does not clarify how these pairs are created, the diversity of pairs (e.g., in topic, token variety), or the validation methods used. Including these details would provide reviewers with valuable insights into the experimental design, helping them better assess the generalizability and universality of the findings. Additional minor points: 1) The acronym "KV" is not

Reviewer 03Rating 8Confidence 2

Strengths

1. Provides in-depth analysis and extensive experiments on various properties of retrieval heads. 2. Well-written, with clear graphics, easy-to-follow explanations, and a well-organized structure. 3. Numerous examples and case studies effectively illustrate the properties, making the concepts easy to understand.

Weaknesses

Nothing major needs to be addressed. Please address some discussion questions in question sections.

Code & Models

Repositories

nightdessert/retrieval_head
pytorchOfficial

Datasets

shaswat123/NIAH
dataset· 7.6k dl
7.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Memory Processes and Influences

MethodsSparse Evolutionary Training · Pruning