Transformers, parallel computation, and logarithmic depth

Clayton Sanford; Daniel Hsu; Matus Telgarsky

arXiv:2402.09268·cs.LG·February 15, 2024·2 cites

Transformers, parallel computation, and logarithmic depth

Clayton Sanford, Daniel Hsu, Matus Telgarsky

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that transformers with logarithmic depth can efficiently perform complex computational tasks, highlighting the importance of parallelism as a core feature distinguishing them from other neural sequence models.

Contribution

It establishes a theoretical connection between transformers and parallel computation, showing that a constant number of self-attention layers suffice for certain tasks.

Findings

01

Logarithmic depth transformers can simulate communication rounds in parallel computation.

02

Transformers outperform other models on specific computational tasks.

03

Parallelism is a key property of transformers that enables their computational efficiency.

Abstract

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chsanford/hop-induction-heads
pytorchOfficial

Videos

Transformers, parallel computation, and logarithmic depth· youtube

Taxonomy

TopicsComputational Geometry and Mesh Generation · Neural Networks and Applications