Memorization Capacity of Multi-Head Attention in Transformers
Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis

TL;DR
This paper analyzes the memorization capacity of multi-head attention in transformers, revealing how the number of heads and sequence length influence their ability to memorize data, supported by theoretical analysis and experiments.
Contribution
It introduces new assumptions about input data independence and provides a theoretical framework for understanding how attention heads memorize data, with validation on synthetic datasets.
Findings
Attention layers with H heads can memorize Ω(Hn) examples.
The number of parameters scales as Θ(Hd^2).
Different heads specialize in different sequences.
Abstract
Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with heads, dimension , and context size , featuring parameters, can memorize examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property.…
Peer Reviews
Decision·ICLR 2024 spotlight
- The paper makes theoretical contributions by exploring the memorization capacity of transformers, an area that is not yet fully understood. This contributes to a deeper understanding of transformer architectures. - The paper introduces new assumptions about the linear independence of input data, distinct from commonly used assumptions. This novel approach provides a fresh perspective on analyzing transformer models. - The findings are validated through experiments on synthetic data. This empi
- Limited Empirical Testing: While the paper includes synthetic experiments, real-world data experiments might be needed to fully understand the practical implications of the findings. - Focus on Single-Layer MHA Module: The study primarily focuses on a single-layer Multi-head Attention (MHA) module. Expanding the analysis to multi-layered architectures could provide more comprehensive insights. - Potential for Broader Impact Analysis: The paper could benefit from a more in-depth discussion on h
1. The paper is well-organized and the proof makes sense. 2. The two input-data assumptions are milder than the General Position assumptions. Although it is impossible to fully verify its generalizability, the author demonstrated the reasonableness of the assumptions through sampling testing, which interests me. 3. The conclusion “When fixing d, n, increasing dh only helps up to dh < n, and there is no memorization gain beyond that” is enlightening and I believe it can bring more valuable thinki
1. It might be significantly different between the image patch tokens (ViT) and the language tokens. Can the author's experimental verification of those assumptions be verified on NLP tasks?
1. The assumptions in this paper are more relaxed The authors verified the rationality of the assumptions on real data. 2. The exploration of memorization capacity of transformers is meaningful for more advanced go-to architecture, while the memorization abilities of attention modules is quite interesting. 3. The paper is well-written.
1. One of my main concern is the illustration or definition of "memorization" in this paper. The inputs of attention include both the key matrix and the query vector. In a common understanding, attention plays a role to capture knowledge from the context according to the "attention" on other tokens for each token. So what does attention memorize? I think the paper should make it clearer before or after the theorectical analysis, or even verify the memorized knowledge with some visualization. 2
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications
MethodsAttention Is All You Need · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Linear Layer · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Byte Pair Encoding
