The two clocks and the innovation window: When and how generative models learn rules
Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu

TL;DR
This paper investigates the timing of rule learning and memorization in generative models, revealing a two-clock structure that explains when models genuinely innovate versus memorize training data.
Contribution
It introduces the concepts of rule-learning time and memorization time, analyzing their dependence on model capacity, rule complexity, and dataset size across different architectures.
Findings
Rule learning time increases with rule complexity and decreases with model capacity.
Memorization time scales nearly linearly with dataset size and is invariant to rule complexity.
The 'innovation window' between rule learning and memorization can vanish, affecting genuine model innovation.
Abstract
Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: , the step at which generations first become rule-valid, and , the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, and , depend on key aspects of the learning setup. Specifically, we show that increases with rule complexity and decreases with model capacity, while is approximately invariant to the rule and scales nearly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
