Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers
Joshua Barron, Devin White

TL;DR
This study explores how the capacity of pre-trained Transformer models influences their ability to memorize facts versus generalize to new data, revealing a fundamental trade-off that impacts model design.
Contribution
It provides a controlled analysis of how model size affects memorization and generalization, highlighting an inherent trade-off in pre-training large language models.
Findings
Small models generalize but do not memorize facts.
Large models memorize but fail to extrapolate.
No model succeeds at both memorization and extrapolation when trained jointly.
Abstract
The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
