Understanding Synthetic Context Extension via Retrieval Heads
Xinyu Zhao, Fangcong Yin, Greg Durrett

TL;DR
This paper investigates how synthetic data fine-tuning enhances long-context understanding in language models, revealing the role of retrieval heads and providing insights for improving synthetic data generation.
Contribution
It uncovers the relationship between synthetic data training and retrieval head development, offering mechanistic insights into long-context learning in LLMs.
Findings
Retrieval heads learned on synthetic data overlap with those from real data.
Strong correlation between retrieval head recall and downstream performance.
Retrieval heads are necessary but not sufficient for long-context tasks.
Abstract
Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle" concepts to be retrieved and diversity of the surrounding "haystack" context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Context-Aware Activity Recognition Systems · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
