Parity, Sensitivity, and Transformers
Alexander Kozachinskiy, Tomasz Steifer, Przemys{\l}aw Wa{\l}\c{e}ga

TL;DR
This paper investigates the capabilities of transformers to compute the PARITY task, establishing that at least two layers are needed and providing a practical four-layer construction that removes previous impractical assumptions.
Contribution
It proves that one-layer transformers cannot compute PARITY and introduces a new four-layer transformer architecture that overcomes previous limitations.
Findings
One-layer transformers grow slower in sensitivity than PARITY.
Two-layer transformers are necessary to compute PARITY.
A practical four-layer transformer can compute PARITY without impractical assumptions.
Abstract
Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
