Bayes' Power for Explaining In-Context Learning Generalizations

Samuel M\"uller; Noah Hollmann; Frank Hutter

arXiv:2410.01565·cs.LG·October 3, 2024

Bayes' Power for Explaining In-Context Learning Generalizations

Samuel M\"uller, Noah Hollmann, Frank Hutter

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper proposes interpreting neural networks as approximations of the true data-generating posterior, explaining in-context learning and its generalization capabilities in large-scale models.

Contribution

It introduces a posterior-based interpretation of neural networks for in-context learning, providing insights into their generalization and limitations.

Findings

01

Models effectively compose knowledge from training data.

02

Surprising generalizations are explained by the true posterior.

03

Limitations of neural networks in approximating posteriors are identified.

Abstract

Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large self-supervised setups, like language models. In this new setup, performance is compute-bound, but data is readily available. As models became more powerful, in-context learning (ICL), i.e., learning in a single forward-pass based on the context, emerged as one of the dominant paradigms. In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process. We demonstrate this interpretations' power for ICL and its usefulness to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 5

Strengths

The paper offers examples that illustrate how sequence models generate posterior predictions on latent factors unseen at training time. I thought the experiments were thoughtfully designed to highlight the basic Bayesian interpretation. While widely known in other contexts, the paper's discussion on the limitations of the posterior approximation interpretation is useful and adds color to the authors' narrative. I found the discussions on how PFN-style attention cannot capture repetitive patterns

Weaknesses

The paper provides interesting toy examples that suggest natural research questions to explore, but does not address any of them in depth. I truly appreciate the visualizations and the discussion in the submission. However, the paper currently reads as if it is a well-written report on toy simulations at the beginning of a research project, or a lecture note surveying well-known facts about sequence models and open research questions. It is unclear to me what the authors' main contributions are,

Reviewer 02Rating 3Confidence 4

Strengths

- The study of step functions as a simple class of functions for in-context learning is novel, as far as I know.

Weaknesses

- I am confused by the use of the term “posterior” in this paper. For someone with a background in Bayesian statistics the posterior is a distribution over model parameters, and I would refer to the p(x,y) in this paper as the true distribution and p(y|x) as the true conditional distribution, rather than the posterior. I think the authors are following the terminology of Xie et al, in which marginalising out the latent vector produces a kind of posterior predictive distribution (thinking of the

Reviewer 03Rating 5Confidence 3

Strengths

1. The experiments demonstrating the interpretative power of ICL are clear and compelling, with ideas presented in a straightforward and illustrative manner. 2. The authors also examine inherent constraints that lead to common failures, such as analyzing the X shift (out of support).

Weaknesses

1. Previous literature has raised questions regarding the Bayesian nature of ICL. For example, Raventós et al. showed that transformers pre-trained on data with low task diversity struggle to learn new tasks and identified a threshold beyond which ICL emerges. Numerous studies suggest that phase transitions occur with respect to both the diversity of the training data (the cardinality of $L$ here) and the context sequence length (the number of context tokens used both during training and inferen

Code & Models

Repositories

samuelgabriel/bayesgeneralizations
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Time Series Analysis and Forecasting · Machine Learning and Data Classification