Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Ashok Vardhan Makkuva; Marco Bondaschi; Adway Girish; Alliot Nagle; Martin Jaggi; Hyeji Kim; Michael Gastpar

arXiv:2402.04161·cs.LG·July 22, 2025·1 cites

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Markov chain-based framework to analyze transformer models, explaining why single-layer transformers often fail to learn bigram distributions and how deeper models succeed, supported by theoretical and empirical evidence.

Contribution

It provides a novel Markov chain framework for analyzing transformers, characterizes the loss landscape of single-layer models, and explains their empirical failure to learn bigram distributions.

Findings

01

Single-layer transformers often get trapped in local minima representing unigram distributions.

02

Deeper transformers reliably learn the in-context bigram distribution.

03

Theoretical analysis matches empirical observations of model behavior.

Abstract

Attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To deepen our understanding of their sequential modeling capabilities, there is a growing interest in using Markov input processes to study them. A key finding is that when trained on first-order Markov chains, transformers with two or more layers consistently develop an induction head mechanism to estimate the in-context bigram conditional distribution. In contrast, single-layer transformers, unable to form an induction head, directly learn the Markov kernel but often face a surprising challenge: they become trapped in local minima representing the unigram distribution, whereas deeper models reliably converge to the ground-truth bigram. While single-layer transformers can theoretically model first-order Markov chains, their empirical failure to learn this simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bond1995/markov
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies