How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman, Zhao Song

TL;DR
This paper generalizes the transformer attention mechanism to capture higher-order correlations using Kronecker computation, providing efficient algorithms under bounded entry conditions and establishing complexity lower bounds.
Contribution
It introduces a novel higher-order attention generalization, offers near-linear time algorithms for bounded entries, and proves lower bounds based on entry magnitude and tensor order.
Findings
Efficient approximation of tensor attention for bounded entries.
Lower bounds for computation time based on entry size and tensor order.
Generalization to higher-order tensors with tradeoffs between expressiveness and efficiency.
Abstract
In the classical transformer attention scheme, we are given three size matrices (the query, key, and value tokens), and the goal is to compute a new size matrix where . In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in . However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Stochastic Gradient Optimization Techniques · Tensor decomposition and applications
