Transformers, parallel computation, and logarithmic depth
Clayton Sanford, Daniel Hsu, Matus Telgarsky

TL;DR
This paper demonstrates that transformers with logarithmic depth can efficiently perform complex computational tasks, highlighting the importance of parallelism as a core feature distinguishing them from other neural sequence models.
Contribution
It establishes a theoretical connection between transformers and parallel computation, showing that a constant number of self-attention layers suffice for certain tasks.
Findings
Logarithmic depth transformers can simulate communication rounds in parallel computation.
Transformers outperform other models on specific computational tasks.
Parallelism is a key property of transformers that enables their computational efficiency.
Abstract
We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Transformers, parallel computation, and logarithmic depth· youtube
Taxonomy
TopicsComputational Geometry and Mesh Generation · Neural Networks and Applications
