Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

Sarwan Ali

arXiv:2605.04396·cs.LG·May 7, 2026

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

Sarwan Ali

PDF

TL;DR

This paper identifies a specific training window during which regularization determines whether Transformers learn to reason or memorize, revealing the importance of timing and initialization in training outcomes.

Contribution

It demonstrates that the memorization versus reasoning behavior in Transformers is governed by a sharp, task-specific training window influenced by initialization and regularization timing.

Findings

01

A 25% training window of weight decay matches full training in OOD accuracy.

02

Positioning regularization in the middle of training yields significantly higher OOD accuracy.

03

The critical window's onset is highly sensitive, shifting with as little as 100 steps.

Abstract

Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy ( $0.93$ vs $0.91$ ); (ii)~holding total regularization budget constant, placing it in the middle of training yields $5 - 9 \times$ higher OOD accuracy than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.