Understanding In-Context Learning via Supportive Pretraining Data
Xiaochuang Han, Daniel Simig, Todor Mihaylov, Yulia Tsvetkov, Asli, Celikyilmaz, Tianlu Wang

TL;DR
This paper investigates the role of specific pretraining data in enabling in-context learning (ICL) in language models, revealing that targeted data supports ICL and can significantly improve model performance.
Contribution
It introduces an iterative, gradient-based method to identify pretraining data that supports ICL and demonstrates that training on this data enhances ICL ability.
Findings
Supportive pretraining data do not have higher domain relevance.
Supportive data contain more long-tail, rare tokens.
Supportive data are challenging examples with low information gain from long-range context.
Abstract
In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
