Competition Dynamics Shape Algorithmic Phases of In-Context Learning
Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, Hidenori Tanaka

TL;DR
This paper introduces a synthetic sequence modeling task to analyze in-context learning (ICL) in large language models, revealing that ICL behavior results from competition among multiple algorithms influenced by experimental conditions.
Contribution
It proposes a unified synthetic framework for studying ICL, explaining model behavior through competing algorithms and mechanisms, and highlights the transient, mixture-based nature of ICL.
Findings
Models trained on the synthetic task replicate known ICL results.
Different algorithms dominate depending on context size and training, causing phase transitions.
ICL is better viewed as a mixture of algorithms rather than a single capability.
Abstract
In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to novel tasks using merely the inputted context. This has motivated a series of papers that analyze tractable synthetic domains and postulate precise mechanisms that may underlie ICL. However, the use of relatively distinct setups that often lack a sequence modeling nature to them makes it unclear how general the reported insights from such studies are. Motivated by this, we propose a synthetic sequence modeling task that involves learning to simulate a finite mixture of Markov chains. As we show, models trained on this task reproduce most well-known results on ICL, hence offering a unified setting for studying the concept. Building on this setup, we demonstrate we can explain a model's behavior by decomposing it into four broad algorithms that combine a…
Peer Reviews
Decision·ICLR 2025 Spotlight
- **Well-written**. This expository work is well written and enjoyable to read. The authors do a good job at holding the reader by the hand through their analyses and provide adequate context even for non-ICL experts. The figure are easily understandable (mostly, see below for a nit) and the flow of the paper makes sense. - **Sound, novel, and unifying benchmark.** The benchmark is novel in the context of ICL studies and it is soundly derived (eg, using the expected KL as quality measure in Eq.
- **Lack of actionable takeaway message**. Because the benchmark is synthetic, it’s unclear how much it says about ICL for real-world LLMs. This is the main flaw of the paper. For example, l.81: it’s good to know that Transformers can learn different algorithms — but which is it for LLMs / VLMs? Certainly not the unigram or bigram of l. 301, right? Similarly, I liked the flavor of paragraph on l. 301 but it’s unclear how to replicate this study on real-world LLMs. This limits the insights into h
1. Very detailed and sound study of the ICL mechanism: varying task diversity, training steps, context length, and evaluating on various metrics. This is a very empirically rigorous study that verifies previous literature by reproducing various well-known results within a single setting. 2. Some analysis on how model design affects downstream performance on various metrics.
My main concern with this paper is its novelty: as the authors have correctly noted, many of the results presented here have already been explored in existing literature. While this paper offers a valuable unifying study that synthesizes and reproduces previous findings in a single framework, the setting itself has also been examined in prior work (e.g., Edelman et al.). Consequently, the overall message lacks new insights.
I think this is a nice, comprehensive paper that introduces a simple setting that unifies many recent papers on ICL and captures analogous phenomena (task diversity thresholds, transience of ICL). For example, Figure 3 basically reproduces the results of two previous papers. The experimental protocols for assessing bigram utilization and proximity to the Bayesian solution were thoughtful and creative. I like the idea of approximating the transformer’s behavior as a mixture of algorithms. It’s
First of all, I want to emphasize that I think this line of work is valuable and scientifically meaningful. However, it would be nice to discuss how these insights translate into practical design choices (e.g., predicting OOD performance using a similar approach to this paper). I think one major premise of the paper is that there’s a transition between different algorithms. I think this is convincing. However, I think I want some kind of control condition where you use 3 or 4 different “silly
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Vision and Imaging
