Transformers on Markov Data: Constant Depth Suffices
Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar,, Ashok Vardhan Makkuva

TL;DR
This paper demonstrates that fixed-depth transformers can effectively model Markov processes, with theoretical and empirical evidence showing that shallow, attention-only transformers suffice to learn in-context distributions.
Contribution
It provides the first theoretical proof that a three-layer, single-head transformer can represent in-context distributions for Markov sources, supported by empirical findings.
Findings
Transformers with fixed depth achieve low test loss on Markov data.
A single-head, three-layer transformer can represent in-context distributions.
Attention-only transformers with O(log k) layers can track previous symbols.
Abstract
Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods
