Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Mingda Wan

TL;DR
This paper provides a theoretical analysis of Tensor Attention and $ extsf{RoPE}$-based Tensor Attention, revealing their limitations in solving certain computational problems despite empirical successes, thus guiding future model development.
Contribution
It offers the first circuit complexity analysis of Tensor Attention and $ extsf{RoPE}$-based models, identifying their fundamental computational constraints.
Findings
Tensor Attention cannot solve fixed membership problems with polynomial precision.
$ extsf{RoPE}$-based Tensor Attention is limited in solving $(A_{F,r})^*$ closure problems.
Theoretical constraints are established under the assumption that $ extsf{TC}^0 eq extsf{NC}^1$.
Abstract
Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding () has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and -based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or closure problems, under the assumption that . These findings highlight a gap between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Computational Physics and Python Applications
MethodsLinear Layer · Dense Connections · Residual Connection · Adam · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Dropout · Softmax
