On the Expressive Power of Self-Attention Matrices
Valerii Likhosherstov, Krzysztof Choromanski, Adrian Weller

TL;DR
This paper provides a theoretical analysis showing that fixed self-attention matrices in transformer models can approximate sparse matrices, with the hidden size growing logarithmically with sequence length, highlighting the expressive power of self-attention.
Contribution
The paper proves that fixed self-attention modules can approximate arbitrary sparse matrices using input modifications, with hidden size logarithmic in sequence length, advancing understanding of self-attention's expressive capabilities.
Findings
Self-attention matrices can approximate sparse matrices.
The required hidden size grows logarithmically with sequence length.
A constructive proof and algorithm for approximation are provided.
Abstract
Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. In practice, this matrix typically exhibits two properties: (1) it is sparse, meaning that each token only attends to a small subset of other tokens; and (2) it changes dynamically depending on the input to the module. With these considerations in mind, we ask the following question: Can a fixed self-attention module approximate arbitrary sparse patterns depending on the input? How small is the hidden size required for such approximation? We make progress in answering this question and show that the self-attention matrix can provably approximate sparse matrices, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Advanced Graph Neural Networks · Machine Learning and Algorithms
