Scaling Laws for Associative Memories
Vivien Cabannes, Elvis Dohmatob, Alberto Bietti

TL;DR
This paper investigates the scaling laws of associative memory mechanisms in high-dimensional models, deriving theoretical relationships and validating them through extensive experiments, with implications for understanding transformer inner layers.
Contribution
It introduces precise scaling laws for associative memories based on high-dimensional matrices and analyzes the statistical efficiency of various estimators, including optimization algorithms.
Findings
Derived explicit scaling laws relating sample size and parameter size.
Validated theoretical predictions with extensive numerical experiments.
Provided visualizations of memory associations in high-dimensional models.
Abstract
Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models. We derive precise scaling laws with respect to sample size and parameter size, and discuss the statistical efficiency of different estimators, including optimization-based algorithms. We provide extensive numerical experiments to validate and interpret theoretical results, including fine-grained visualizations of the stored memory associations.
Peer Reviews
Decision·ICLR 2024 spotlight
## Admirably formalizes the scaling laws in Transformers as a memorization/memory retrieval task in Associative Memories - (+ +) The paper clearly and thoroughly defines a "sandbox" problem setting where we can study scaling laws (of discrete data domains, like the vocabulary tokens in NLP) using principles of Associative Memory - (+) The paper includes experiments using the Associative Memory sandbox to draw conclusions about good optimizers, learning rates, and batch sizes in larger models. -
## Experiments not able to scale to large models 1. (-) It took several readings to understand the experimental setup. The clarity of the paper would be improved with a small architectural diagram describing the setting. 2. (-) To my understanding, the proposed method can only study Transformer blocks individually, not the entire Transformer as a whole (This is my understanding of Sec 4 paragraph 1: "our model is a proxy for the inner layers of a transformer") 3. (-) Like 2., the proposed metho
- The presentation is excellently organized, the notations, definitions and associated propositions and theorems are carefully stated and accompanied by clean supporting simulation plots, the cases explored make up a comprehensive and complete narrative for this interesting theoretical work.
- The current setup is synthetic/artificial: it is a drastic simplification of configurations found in practice, e.g. for real transformers. Although there are clear notes in the text for the potential deviations of this simplified model to a real one, it remains to be seen how well analogies hold. To this end, perhaps crisper (albeit riskier) predictions of how some of these results would translate/map to tangible observations in a real transformer would help the reader better appreciate the im
- The derived results look interesting in the context of the chosen model and construction of its parameters. - Looking at discrete data with a real-word-like distribution is a promising idea. - I think re-framed in the correct context, the result could add nice insights to scaling laws, even though in their current presentation they are more confusing than insightful.
- While the introduction and title suggest that the paper considers the memory capacity of associative memories, it seems that in fact it is investigating the error scaling laws of a specific learning problem, where a discrete input x determines an output y. The suspicion that this is learning, is corroborated by the fact that giving more data for a fixed dimension (e.g. Fig.3 right) improves the error. If the model was truly memorizing, eventually there would be a cut-off and no new data could
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and Algorithms · Error Correcting Code Techniques
