TL;DR
This paper analyzes the expressive power of standard low-precision softmax transformers with chain-of-thought, bridging the gap between theoretical models and practical architectures, and demonstrates their ability to simulate Turing machines efficiently.
Contribution
It introduces a new analysis of softmax transformers with low precision, showing they can simulate Turing machines and perform reasoning tasks effectively, aligning theory with practice.
Findings
Softmax transformers can simulate Turing machines with logarithmic model size.
Summarized Chain-of-Thought improves simulation efficiency, scaling logarithmically in space.
Empirical tests on Sudoku show better alignment with learnability than prior high-precision models.
Abstract
Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
