Re-examining learning linear functions in context

Omar Naim; Guilhem Fouilh\'e; Nicholas Asher

arXiv:2411.11465·cs.LG·September 3, 2025

Re-examining learning linear functions in context

Omar Naim, Guilhem Fouilh\'e, Nicholas Asher

PDF

Open Access 4 Reviews

TL;DR

This paper investigates how transformer models learn linear functions in-context, revealing limitations in their ability to generalize and proposing a precise hypothesis of their learning process.

Contribution

It provides a controlled experimental analysis of in-context learning for linear functions, challenging assumptions about the models' algorithmic capabilities.

Findings

01

Transformers trained from scratch fail to generalize beyond training distribution.

02

Models do not adopt simple algorithmic solutions like linear regression.

03

A mathematically precise hypothesis of the learned process is proposed.

Abstract

In-context learning (ICL) has emerged as a powerful paradigm for easily adapting Large Language Models (LLMs) to various tasks. However, our understanding of how ICL works remains limited. We explore a simple model of ICL in a controlled setup with synthetic training data to investigate ICL of univariate linear functions. We experiment with a range of GPT-2-like transformer models trained from scratch. Our findings challenge the prevailing narrative that transformers adopt algorithmic approaches like linear regression to learn a linear function in-context. These models fail to generalize beyond their training distribution, highlighting fundamental limitations in their capacity to infer abstract task structures. Our experiments lead us to propose a mathematically precise hypothesis of what the model might be learning.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

Understanding what these models learn even in the setting of linear regression can significantly enhance our understanding of their capabilities limitations. Indeed it has been observed that the models do not generalize in out-of-distribution samples and thus it is unclear whether these models learn some type of algorithm.

Weaknesses

1. The provided experimental study does not explain what these models are actually learning. For example it can be the case that the model are learning a tailor-made preconditioned gradient descent type of algorithm, with the preconditioned matrix being optimal for the in-distribution values and sub-optimal for out-of-distribution values. 2. It cannot be excluded that the current training methods are not optimal, since we know that these models do have the capability of representing these algori

Reviewer 02Rating 1Confidence 4

Strengths

The paper conducts thorough experiments across various scales and settings, providing a comprehensive analysis of Transformer behavior.

Weaknesses

1. The related work could benefit from a more comprehensive review. The paper primarily discusses the works of Garg et al., Akyürek et al., and Von Oswald et al. on regression for in-context learning (ICL), but there are additional relevant studies in this area that are not cited. A more thorough literature review, covering empirical and theoretical works on regression in ICL, would enhance the paper’s context. Checking recent citations in this line of research may help identify key studies to i

Reviewer 03Rating 3Confidence 4

Strengths

The paper has the following strengths: ### 1. Clear Motivation: The paper begins with a well-defined motivation, addressing gaps in the current understanding of in-context learning (ICL) in transformer models, especially for generalization. ### 2. Comprehensive Experiments: The experiments cover various transformer architectures and test them on various distributions.

Weaknesses

The paper has the following weaknesses: ### 1. Clarify Terminology and Notation: The writing is a little poor. For example, in ``line 047'', there should be a ''.'' after ''training data''. Furthermore, the table should be in a more beautiful structure. ### 2. Explanation for the Problem: Although the paper provides various experiments, it should explain the failures of these models to generalize to data not in the training distribution. ### 3. Novelty: The paper provides robust experiments t

Reviewer 04Rating 5Confidence 3

Strengths

1. The paper focuses on an important and challenging problem: understanding the in-context ability of language models. 2. The writing is clear and easy to understand. 3. The authors provide code and detailed instructions for reproduction.

Weaknesses

1. The models and empirical studies in the paper differ significantly from current large language models, potentially creating a gap between the claims and reality. 2. The findings of the paper have been previously proposed in other works. 3. The paper is missing some key references.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Evolutionary Algorithms and Applications

MethodsLinear Regression · ADaptive gradient method with the OPTimal convergence rate