Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers
Charles London, Varun Kanade

TL;DR
This paper proves that pause tokens significantly increase the computational expressivity of constant-depth Transformers, both theoretically and empirically, enabling them to learn complex functions like parity.
Contribution
It provides the first formal proof that pause tokens expand Transformer expressivity, showing their effect on computational classes and empirical learning capabilities.
Findings
Adding pause tokens elevates Transformers from $ ext{AC}^0$ to $ ext{TC}^0$ expressivity.
Transformers with pause tokens can learn parity functions, which they cannot without them.
Theoretical results explain empirical improvements observed with pause tokens.
Abstract
Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to , matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Modular Robots and Swarm Intelligence · Advanced Materials and Mechanics
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
