Tucker Attention: A generalization of approximate attention mechanisms
Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotth\"ofer

TL;DR
Tucker Attention introduces a generalized, parameter-efficient self-attention mechanism that encompasses existing methods like GQA, MLA, and MHA, offering comparable performance with fewer parameters.
Contribution
The paper proposes Tucker Attention, a unified framework that generalizes and simplifies existing low-rank attention methods, improving efficiency and interpretability.
Findings
Tucker Attention requires an order of magnitude fewer parameters than GQA and MLA.
It achieves comparable validation metrics in LLM and ViT tests.
Encompasses GQA, MLA, MHA as special cases and is compatible with flash-attention and RoPE.
Abstract
The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
