Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize
Sarwan Ali

TL;DR
This paper identifies a specific training window during which regularization determines whether Transformers learn to reason or memorize, revealing the importance of timing and initialization in training outcomes.
Contribution
It demonstrates that the memorization versus reasoning behavior in Transformers is governed by a sharp, task-specific training window influenced by initialization and regularization timing.
Findings
A 25% training window of weight decay matches full training in OOD accuracy.
Positioning regularization in the middle of training yields significantly higher OOD accuracy.
The critical window's onset is highly sensitive, shifting with as little as 100 steps.
Abstract
Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy ( vs ); (ii)~holding total regularization budget constant, placing it in the middle of training yields higher OOD accuracy than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
