Strassen Attention, Split VC Dimension and Compositionality in Transformers

Alexander Kozachinskiy; Felipe Urrutia; Hector Jimenez; Tomasz Steifer; Germ\'an Pizarro; Mat\'ias Fuentes; Francisco Meza; Cristian B. Calderon; Crist\'obal Rojas

arXiv:2501.19215·cs.LG·September 26, 2025

Strassen Attention, Split VC Dimension and Compositionality in Transformers

Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian B. Calderon, Crist\'obal Rojas

PDF

Open Access 1 Video

TL;DR

This paper establishes theoretical limitations of one-layer softmax transformers on complex reasoning tasks and introduces Strassen attention, which overcomes these limitations with improved scalability and performance.

Contribution

The paper introduces Strassen attention, a novel mechanism that enables one-layer transformers to solve advanced reasoning tasks, overcoming previous theoretical limitations.

Findings

01

Strassen attention enables solving complex reasoning tasks.

02

It has sub-cubic time complexity, improving scalability.

03

It outperforms standard attention on all tested tasks.

Abstract

We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Strassen Attention, Split VC Dimension and Compositionality in Transformers· slideslive

Taxonomy

TopicsNeural Networks and Applications · Machine Learning in Materials Science · Manufacturing Process and Optimization

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Label Smoothing · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Softmax