Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based   Tensor Attention Transformers

Xiaoyu Li; Yingyu Liang; Zhenmei Shi; Zhao Song; Mingda Wan

arXiv:2412.18040·cs.LG·December 25, 2024

Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Mingda Wan

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of Tensor Attention and $ extsf{RoPE}$-based Tensor Attention, revealing their limitations in solving certain computational problems despite empirical successes, thus guiding future model development.

Contribution

It offers the first circuit complexity analysis of Tensor Attention and $ extsf{RoPE}$-based models, identifying their fundamental computational constraints.

Findings

01

Tensor Attention cannot solve fixed membership problems with polynomial precision.

02

$ extsf{RoPE}$-based Tensor Attention is limited in solving $(A_{F,r})^*$ closure problems.

03

Theoretical constraints are established under the assumption that $ extsf{TC}^0 eq extsf{NC}^1$.

Abstract

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ( $RoPE$ ) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $RoPE$ -based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F, r})^{*}$ closure problems, under the assumption that $TC^{0} \neq = NC^{1}$ . These findings highlight a gap between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Computational Physics and Python Applications

MethodsLinear Layer · Dense Connections · Residual Connection · Adam · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Dropout · Softmax