A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention
Xiaowei Ye, Xiaoyu He, Chao Liao, Chen Wu, Pinyan Lu

TL;DR
This paper provides a rigorous theoretical comparison of the expressive power of hybrid linear attention versus full attention in transformers, establishing a hierarchy and demonstrating a clear separation in their capabilities.
Contribution
It introduces the first provable hierarchy showing the expressive limitations of hybrid attention compared to full attention in transformer models.
Findings
Full attention networks with L+1 layers suffice for certain reasoning tasks.
Hybrid attention networks require exponentially more layers to match full attention.
A formal separation in expressive power between hybrid and full attention mechanisms.
Abstract
Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model's forward pass, an ()-layer full attention network is sufficient, whereas any hybrid network interleaving layers of full attention with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
