Paying Attention to Facts: Quantifying the Knowledge Capacity of Attention Layers
Liang Ze Wong

TL;DR
This paper analyzes the capacity of attention layers in transformers to memorize facts by using a tensor rank measure, providing bounds and empirical insights into how design choices affect factual recall.
Contribution
It introduces a tensor-based framework to quantify the knowledge capacity of attention layers and explores how their design influences memorization ability.
Findings
Tensor rank correlates with database size and memorization capacity.
Value-output, query-key weights, and activation functions impact rank and capacity.
Insights suggest ways to increase layer capacity without adding parameters.
Abstract
In this paper, we investigate the ability of single-layer attention-only transformers (i.e. attention layers) to memorize facts contained in databases from a linear-algebraic perspective. We associate with each database a 3-tensor, propose the rank of this tensor as a measure of the size of the database, and provide bounds on the rank in terms of properties of the database. We also define a 3-tensor corresponding to an attention layer, and empirically demonstrate the relationship between its rank and database rank on a dataset of toy models and random databases. By highlighting the roles played by the value-output and query-key weights, and the effects of argmax and softmax on rank, our results shed light on the `additive motif' of factual recall in transformers, while also suggesting a way of increasing layer capacity without increasing the number of parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
MethodsAttention Is All You Need · Softmax
