Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning
Shin So, Kyelim Lee, and Albert No

TL;DR
This paper investigates how dataset size affects learning efficiency in structured-output tasks, revealing a sweet spot where smaller datasets lead to faster generalization, contrary to common intuition.
Contribution
It demonstrates that in structured-output tasks, the optimal dataset size for validation accuracy differs from the size that minimizes training updates, highlighting a nuanced learning dynamic.
Findings
Small Transformers reach high accuracy fastest at intermediate dataset sizes.
Larger datasets can require fewer updates to achieve high training accuracy in certain regimes.
A baseline multiplication model does not exhibit the same slowdown after the dataset size sweet spot.
Abstract
Critical-data-size accounts of grokking suggest a natural post-threshold intuition: once training data is sufficient to identify the underlying rule, additional data should accelerate validation convergence. We show that this intuition can fail in a controlled structured-output task. In Needleman--Wunsch (NW) matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, not at the largest one. Past this dataset-size sweet spot, generalization remains achievable but requires more gradient updates. Conversely, in the regime where partial validation competence first appears, larger datasets can require fewer updates to reach high training accuracy, suggesting that emerging rule structure can accelerate fitting beyond example-wise memorization. A multiplication baseline does not show the same post-threshold slowdown. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
