On the Benefits of Rank in Attention Layers

Noah Amsel; Gilad Yehudai; and Joan Bruna

arXiv:2407.16153·cs.LG·July 24, 2024

On the Benefits of Rank in Attention Layers

Noah Amsel, Gilad Yehudai, and Joan Bruna

PDF

1 Repo

TL;DR

This paper investigates the trade-offs between rank and number of heads in attention mechanisms, revealing that low-rank attention requires exponentially many heads for certain functions, with depth helping for short contexts.

Contribution

It provides theoretical insights into the importance of full-rank attention and the limitations of low-rank approximations, supported by empirical validation.

Findings

01

Full-rank attention can represent certain functions for any context length.

02

Low-rank attention needs exponentially many heads to approximate some functions.

03

Depth can enable low-rank attention to approximate targets for short contexts.

Abstract

Attention-based mechanisms are widely used in machine learning, most prominently in transformers. However, hyperparameters such as the rank of the attention matrices and the number of heads are scaled nearly the same way in all realizations of this architecture, without theoretical justification. In this work we show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism. Specifically, we present a simple and natural target function that can be represented using a single full-rank attention head for any context length, but that cannot be approximated by low-rank attention unless the number of heads is exponential in the embedding dimension, even for short context lengths. Moreover, we prove that, for short context lengths, adding depth allows the target to be approximated by low-rank attention. For long contexts, we conjecture that full-rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NoahAmsel/attention-formers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need