TL;DR
This paper investigates the theoretical computational capabilities of Transformers, establishing their Turing-completeness under various configurations, and explores the roles of different components in their power, with experimental validation on translation and synthetic tasks.
Contribution
It provides a simplified proof of Transformers' Turing-completeness and analyzes the necessity of components like residual connections, offering new insights into their computational power.
Findings
Transformers are Turing-complete even without positional encodings.
A specific residual connection type is essential for Turing-completeness.
Experimental results demonstrate practical implications of the theoretical findings.
Abstract
Transformers are being used extensively across several sequence modeling tasks. Significant research effort has been devoted to experimentally probe the inner workings of Transformers. However, our conceptual and theoretical understanding of their power and inherent limitations is still nascent. In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear. In this paper, we take a step towards answering these questions. We analyze the computational power as captured by Turing-completeness. We first provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete and then we prove that Transformers with only positional masking and without any positional encoding are also Turing-complete. We further analyze the necessity of each component for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsResidual Connection
