Memorization Capacity of Multi-Head Attention in Transformers

Sadegh Mahdavi; Renjie Liao; Christos Thrampoulidis

arXiv:2306.02010·cs.LG·March 5, 2024·1 cites

Memorization Capacity of Multi-Head Attention in Transformers

Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper analyzes the memorization capacity of multi-head attention in transformers, revealing how the number of heads and sequence length influence their ability to memorize data, supported by theoretical analysis and experiments.

Contribution

It introduces new assumptions about input data independence and provides a theoretical framework for understanding how attention heads memorize data, with validation on synthetic datasets.

Findings

01

Attention layers with H heads can memorize Ω(Hn) examples.

02

The number of parameters scales as Θ(Hd^2).

03

Different heads specialize in different sequences.

Abstract

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$ , and context size $n < d$ , featuring $Θ (H d^{2})$ parameters, can memorize $Ω (H n)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property.…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The paper makes theoretical contributions by exploring the memorization capacity of transformers, an area that is not yet fully understood. This contributes to a deeper understanding of transformer architectures. - The paper introduces new assumptions about the linear independence of input data, distinct from commonly used assumptions. This novel approach provides a fresh perspective on analyzing transformer models. - The findings are validated through experiments on synthetic data. This empi

Weaknesses

- Limited Empirical Testing: While the paper includes synthetic experiments, real-world data experiments might be needed to fully understand the practical implications of the findings. - Focus on Single-Layer MHA Module: The study primarily focuses on a single-layer Multi-head Attention (MHA) module. Expanding the analysis to multi-layered architectures could provide more comprehensive insights. - Potential for Broader Impact Analysis: The paper could benefit from a more in-depth discussion on h

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1. The paper is well-organized and the proof makes sense. 2. The two input-data assumptions are milder than the General Position assumptions. Although it is impossible to fully verify its generalizability, the author demonstrated the reasonableness of the assumptions through sampling testing, which interests me. 3. The conclusion “When fixing d, n, increasing dh only helps up to dh < n, and there is no memorization gain beyond that” is enlightening and I believe it can bring more valuable thinki

Weaknesses

1. It might be significantly different between the image patch tokens (ViT) and the language tokens. Can the author's experimental verification of those assumptions be verified on NLP tasks?

Reviewer 03Rating 8· accept, good paperConfidence 2

Strengths

1. The assumptions in this paper are more relaxed The authors verified the rationality of the assumptions on real data. 2. The exploration of memorization capacity of transformers is meaningful for more advanced go-to architecture, while the memorization abilities of attention modules is quite interesting. 3. The paper is well-written.

Weaknesses

1. One of my main concern is the illustration or definition of "memorization" in this paper. The inputs of attention include both the key matrix and the query vector. In a common understanding, attention plays a role to capture knowledge from the context according to the "attention" on other tokens for each token. So what does attention memorize? I think the paper should make it clearer before or after the theorectical analysis, or even verify the memorized knowledge with some visualization. 2

Code & Models

Repositories

smahdavi4/attention-memorization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications

MethodsAttention Is All You Need · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Linear Layer · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Byte Pair Encoding