Beyond Scaling Laws: Understanding Transformer Performance with   Associative Memory

Xueyan Niu; Bo Bai; Lei Deng; Wei Han

arXiv:2405.08707·cs.LG·December 2, 2024

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Xueyan Niu, Bo Bai, Lei Deng, Wei Han

PDF

Open Access

TL;DR

This paper presents a theoretical framework using associative memories and Hopfield networks to explain why increasing transformer size doesn't always improve performance, linking memorization to model behavior.

Contribution

It introduces a novel energy-based model of transformers with associative memory, providing insights into memorization and performance limits beyond empirical scaling laws.

Findings

01

Transformer performance is linked to memorization of training samples.

02

A global energy function captures the layered architecture of transformers.

03

There is a size-dependent optimal performance bound related to dataset size.

Abstract

Increasing the size of a Transformer does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, the model's enhanced performance is closely associated with its memorization of the training samples. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. In particular, the energy function in modern continuous Hopfield networks serves as an explanation for the attention mechanism, which we approximate with a distance-based energy function. By observing that the softmax function corresponds to the gradient of the LogSumExp function in the energy, and employing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTheatre and Performance Studies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Dense Connections · Attention Dropout · Position-Wise Feed-Forward Layer · Weight Decay · Cosine Annealing · Dropout