TL;DR
This paper introduces Kronecker Attention Operators (KAOs) that operate directly on high-order tensor data using matrix-variate normal distributions, significantly reducing computational costs while maintaining or improving performance.
Contribution
The paper proposes a novel attention mechanism for high-order data that avoids flattening and leverages matrix-variate normal distributions, leading to substantial computational savings.
Findings
KAOs reduce computational resources by hundreds of times.
Networks with KAOs outperform non-attention models.
KAOs achieve competitive performance with original attention methods.
Abstract
Attention operators have been applied on both 1-D data like texts and higher-order data such as images and videos. Use of attention operators on high-order data requires flattening of the spatial or spatial-temporal dimensions into a vector, which is assumed to follow a multivariate normal distribution. This not only incurs excessive requirements on computational resources, but also fails to preserve structures in data. In this work, we propose to avoid flattening by assuming the data follow matrix-variate normal distributions. Based on this new view, we develop Kronecker attention operators (KAOs) that operate on high-order tensor data directly. More importantly, the proposed KAOs lead to dramatic reductions in computational resources. Experimental results show that our methods reduce the amount of required computational resources by a factor of hundreds, with larger factors for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
