The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos, Tsilivis

TL;DR
This paper investigates how large language models develop in-context learning abilities for Markov chains, revealing a multi-phase training process and the emergence of statistical induction heads that predict based on bigram statistics.
Contribution
It introduces a Markov Chain sequence modeling task to analyze in-context learning, providing empirical and theoretical insights into the phases of model training and the emergence of statistical induction heads.
Findings
Models pass through multiple training phases from uniform to bigram prediction.
Presence of unigram solutions may delay bigram solution formation.
Learning dynamics are influenced by the prior distribution over Markov chains.
Abstract
Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induction heads} which compute accurate next-token probabilities given the bigram statistics of the context. During the course of training, models pass through multiple phases: after an initial stage in which predictions are uniform, they learn to sub-optimally predict using in-context single-token statistics (unigrams); then, there is a rapid phase transition to the correct in-context bigram solution. We conduct an empirical and theoretical investigation of this multi-phase process, showing how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
