Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Xueyan Niu, Bo Bai, Lei Deng, Wei Han

TL;DR
This paper presents a theoretical framework using associative memories and Hopfield networks to explain why increasing transformer size doesn't always improve performance, linking memorization to model behavior.
Contribution
It introduces a novel energy-based model of transformers with associative memory, providing insights into memorization and performance limits beyond empirical scaling laws.
Findings
Transformer performance is linked to memorization of training samples.
A global energy function captures the layered architecture of transformers.
There is a size-dependent optimal performance bound related to dataset size.
Abstract
Increasing the size of a Transformer does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, the model's enhanced performance is closely associated with its memorization of the training samples. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. In particular, the energy function in modern continuous Hopfield networks serves as an explanation for the attention mechanism, which we approximate with a distance-based energy function. By observing that the softmax function corresponds to the gradient of the LogSumExp function in the energy, and employing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTheatre and Performance Studies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Dense Connections · Attention Dropout · Position-Wise Feed-Forward Layer · Weight Decay · Cosine Annealing · Dropout
