Differential learning kinetics govern the transition from memorization   to generalization during in-context learning

Alex Nguyen; Gautam Reddy

arXiv:2412.00104·cs.LG·December 13, 2024

Differential learning kinetics govern the transition from memorization to generalization during in-context learning

Alex Nguyen, Gautam Reddy

PDF

Open Access

TL;DR

This paper investigates how differential learning rates of sub-circuits in transformers govern the shift from memorization to generalization during in-context learning, revealing a scaling law and mechanistic insights.

Contribution

It introduces a theory explaining the transition from memorization to generalization based on learning kinetics, independent of model capacity, supported by experiments on a synthetic task.

Findings

01

Memorization and generalization sub-circuits learn at different rates.

02

A memorization scaling law predicts the task diversity threshold for generalization.

03

The theory explains phenomena like ICL acquisition timing and solution bimodality.

Abstract

Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Neural Networks and Applications · Misinformation and Its Impacts

MethodsSparse Evolutionary Training