Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Charles London; Varun Kanade

arXiv:2505.21024·cs.LG·May 28, 2025

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Charles London, Varun Kanade

PDF

Open Access

TL;DR

This paper proves that pause tokens significantly increase the computational expressivity of constant-depth Transformers, both theoretically and empirically, enabling them to learn complex functions like parity.

Contribution

It provides the first formal proof that pause tokens expand Transformer expressivity, showing their effect on computational classes and empirical learning capabilities.

Findings

01

Adding pause tokens elevates Transformers from $ ext{AC}^0$ to $ ext{TC}^0$ expressivity.

02

Transformers with pause tokens can learn parity functions, which they cannot without them.

03

Theoretical results explain empirical improvements observed with pause tokens.

Abstract

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $AC^{0}$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $TC^{0}$ , matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Modular Robots and Swarm Intelligence · Advanced Materials and Mechanics

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing