Differential learning kinetics govern the transition from memorization to generalization during in-context learning
Alex Nguyen, Gautam Reddy

TL;DR
This paper investigates how differential learning rates of sub-circuits in transformers govern the shift from memorization to generalization during in-context learning, revealing a scaling law and mechanistic insights.
Contribution
It introduces a theory explaining the transition from memorization to generalization based on learning kinetics, independent of model capacity, supported by experiments on a synthetic task.
Findings
Memorization and generalization sub-circuits learn at different rates.
A memorization scaling law predicts the task diversity threshold for generalization.
The theory explains phenomena like ICL acquisition timing and solution bimodality.
Abstract
Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Neural Networks and Applications · Misinformation and Its Impacts
MethodsSparse Evolutionary Training
