Strassen Attention, Split VC Dimension and Compositionality in Transformers
Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian B. Calderon, Crist\'obal Rojas

TL;DR
This paper establishes theoretical limitations of one-layer softmax transformers on complex reasoning tasks and introduces Strassen attention, which overcomes these limitations with improved scalability and performance.
Contribution
The paper introduces Strassen attention, a novel mechanism that enables one-layer transformers to solve advanced reasoning tasks, overcoming previous theoretical limitations.
Findings
Strassen attention enables solving complex reasoning tasks.
It has sub-cubic time complexity, improving scalability.
It outperforms standard attention on all tested tasks.
Abstract
We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Machine Learning in Materials Science · Manufacturing Process and Optimization
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Label Smoothing · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Softmax
