Transformers on Markov Data: Constant Depth Suffices

Nived Rajaraman; Marco Bondaschi; Kannan Ramchandran; Michael Gastpar,; Ashok Vardhan Makkuva

arXiv:2407.17686·cs.LG·July 26, 2024

Transformers on Markov Data: Constant Depth Suffices

Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar,, Ashok Vardhan Makkuva

PDF

Open Access

TL;DR

This paper demonstrates that fixed-depth transformers can effectively model Markov processes, with theoretical and empirical evidence showing that shallow, attention-only transformers suffice to learn in-context distributions.

Contribution

It provides the first theoretical proof that a three-layer, single-head transformer can represent in-context distributions for Markov sources, supported by empirical findings.

Findings

01

Transformers with fixed depth achieve low test loss on Markov data.

02

A single-head, three-layer transformer can represent in-context distributions.

03

Attention-only transformers with O(log k) layers can track previous symbols.

Abstract

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods