How to Capture Higher-order Correlations? Generalizing Matrix Softmax   Attention to Kronecker Computation

Josh Alman; Zhao Song

arXiv:2310.04064·cs.DS·October 9, 2023·1 cites

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Josh Alman, Zhao Song

PDF

Open Access

TL;DR

This paper generalizes the transformer attention mechanism to capture higher-order correlations using Kronecker computation, providing efficient algorithms under bounded entry conditions and establishing complexity lower bounds.

Contribution

It introduces a novel higher-order attention generalization, offers near-linear time algorithms for bounded entries, and proves lower bounds based on entry magnitude and tensor order.

Findings

01

Efficient approximation of tensor attention for bounded entries.

02

Lower bounds for computation time based on entry size and tensor order.

03

Generalization to higher-order tensors with tradeoffs between expressiveness and efficiency.

Abstract

In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{- 1} exp (Q K^{⊤}) V$ where $D = diag (exp (Q K^{⊤}) 1_{n})$ . In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$ . However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Theory and Algorithms · Stochastic Gradient Optimization Techniques · Tensor decomposition and applications