Re-examining learning linear functions in context
Omar Naim, Guilhem Fouilh\'e, Nicholas Asher

TL;DR
This paper investigates how transformer models learn linear functions in-context, revealing limitations in their ability to generalize and proposing a precise hypothesis of their learning process.
Contribution
It provides a controlled experimental analysis of in-context learning for linear functions, challenging assumptions about the models' algorithmic capabilities.
Findings
Transformers trained from scratch fail to generalize beyond training distribution.
Models do not adopt simple algorithmic solutions like linear regression.
A mathematically precise hypothesis of the learned process is proposed.
Abstract
In-context learning (ICL) has emerged as a powerful paradigm for easily adapting Large Language Models (LLMs) to various tasks. However, our understanding of how ICL works remains limited. We explore a simple model of ICL in a controlled setup with synthetic training data to investigate ICL of univariate linear functions. We experiment with a range of GPT-2-like transformer models trained from scratch. Our findings challenge the prevailing narrative that transformers adopt algorithmic approaches like linear regression to learn a linear function in-context. These models fail to generalize beyond their training distribution, highlighting fundamental limitations in their capacity to infer abstract task structures. Our experiments lead us to propose a mathematically precise hypothesis of what the model might be learning.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Understanding what these models learn even in the setting of linear regression can significantly enhance our understanding of their capabilities limitations. Indeed it has been observed that the models do not generalize in out-of-distribution samples and thus it is unclear whether these models learn some type of algorithm.
1. The provided experimental study does not explain what these models are actually learning. For example it can be the case that the model are learning a tailor-made preconditioned gradient descent type of algorithm, with the preconditioned matrix being optimal for the in-distribution values and sub-optimal for out-of-distribution values. 2. It cannot be excluded that the current training methods are not optimal, since we know that these models do have the capability of representing these algori
The paper conducts thorough experiments across various scales and settings, providing a comprehensive analysis of Transformer behavior.
1. The related work could benefit from a more comprehensive review. The paper primarily discusses the works of Garg et al., Akyürek et al., and Von Oswald et al. on regression for in-context learning (ICL), but there are additional relevant studies in this area that are not cited. A more thorough literature review, covering empirical and theoretical works on regression in ICL, would enhance the paper’s context. Checking recent citations in this line of research may help identify key studies to i
The paper has the following strengths: ### 1. Clear Motivation: The paper begins with a well-defined motivation, addressing gaps in the current understanding of in-context learning (ICL) in transformer models, especially for generalization. ### 2. Comprehensive Experiments: The experiments cover various transformer architectures and test them on various distributions.
The paper has the following weaknesses: ### 1. Clarify Terminology and Notation: The writing is a little poor. For example, in ``line 047'', there should be a ''.'' after ''training data''. Furthermore, the table should be in a more beautiful structure. ### 2. Explanation for the Problem: Although the paper provides various experiments, it should explain the failures of these models to generalize to data not in the training distribution. ### 3. Novelty: The paper provides robust experiments t
1. The paper focuses on an important and challenging problem: understanding the in-context ability of language models. 2. The writing is clear and easy to understand. 3. The authors provide code and detailed instructions for reproduction.
1. The models and empirical studies in the paper differ significantly from current large language models, potentially creating a gap between the claims and reality. 2. The findings of the paper have been previously proposed in other works. 3. The paper is missing some key references.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Evolutionary Algorithms and Applications
MethodsLinear Regression · ADaptive gradient method with the OPTimal convergence rate
