Saturated Transformers are Constant-Depth Threshold Circuits
William Merrill, Ashish Sabharwal, Noah A. Smith

TL;DR
This paper demonstrates that saturated transformers, which better model practical attention mechanisms, can be simulated by constant-depth threshold circuits, extending the theoretical understanding of their computational power.
Contribution
It shows that saturated transformers are equivalent to constant-depth threshold circuits, surpassing the limitations of hard-attention transformers in formal language recognition.
Findings
Saturated transformers transcend hard-attention limitations.
They can be simulated by constant-depth threshold circuits.
Recognize languages within the class TC^0.
Abstract
Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al. 2021). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning and Data Classification
