On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew K. Lampinen; Arslan Chaudhry; Stephanie C.Y. Chan; Cody Wild; Diane Wan; Alex Ku; J\"org Bornschein; Razvan Pascanu; Murray Shanahan; James L. McClelland

arXiv:2505.00661·cs.CL·November 12, 2025

On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew K. Lampinen, Arslan Chaudhry, Stephanie C.Y. Chan, Cody Wild, Diane Wan, Alex Ku, J\"org Bornschein, Razvan Pascanu, Murray Shanahan, James L. McClelland

PDF

Open Access

TL;DR

This paper compares in-context learning and fine-tuning in large language models, revealing that ICL often generalizes more flexibly and proposing a method to enhance fine-tuning through in-context reasoning traces.

Contribution

The study introduces novel datasets for evaluating generalization, and proposes a method to improve fine-tuning by incorporating in-context reasoning traces.

Findings

01

ICL can generalize inferences more flexibly than fine-tuning in data-matched settings

02

Adding in-context reasoning traces to fine-tuning data improves generalization

03

Fine-tuning can sometimes generalize to reversals within larger knowledge structures

Abstract

Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to generalize to simple reversals of relations they are trained on, or fail to make simple logical deductions based on trained information. These failures to generalize factual information from fine-tuning can significantly hinder the reasoning capabilities of these models. On the other hand, language models' in-context learning (ICL) shows different inductive biases and deductive reasoning capabilities. Here, we explore these differences in generalization and deductive reasoning between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' abilities to make generalizations over factual information from novel data. These datasets are designed to create clean tests of generalization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques