What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Chanakya Ekbote; Marco Bondaschi; Nived Rajaraman; Jason D. Lee; Michael Gastpar; Ashok Vardhan Makkuva; Paul Pu Liang

arXiv:2508.07208·cs.LG·November 18, 2025

What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang

PDF

Open Access

TL;DR

This paper proves that a two-layer transformer with one attention head per layer can represent any conditional k-gram, demonstrating that shallow transformers have strong in-context learning capabilities for structured sequence tasks.

Contribution

It provides the first theoretical proof that two-layer transformers can represent any kth-order Markov process, clarifying the relationship between depth and sequence modeling.

Findings

01

Two-layer transformers can represent any conditional k-gram.

02

Training dynamics show effective in-context representations emerge during learning.

03

Deepens understanding of transformer depth and Markov order in ICL.

Abstract

In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis