Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts
Eren Unlu

TL;DR
This paper investigates how different warm start strategies affect width expansion in dense language models, revealing that preservation alone is insufficient and the best approach depends on the specific regime and continuation lag.
Contribution
It introduces a regime-sensitive framework for selecting warm starts during width growth, challenging the assumption that preservation is always optimal.
Findings
Exact-copy warm starts perform best in long deterministic continuations.
Structured non-clone warm starts excel in deterministic 128-step continuation.
Preservation is not a universal criterion; effectiveness varies with regime and lag.
Abstract
Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. We study dense width growth as a candidate-selection problem over full training states, including copied weights, optimizer moments, and scheduler state. In a small-scale TinyStories proxy, we compare exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets. We evaluate zero-step preservation, short-lag probe metrics, and downstream continuation utility in deterministic and stochastic regimes. The picture is mixed and partially replicated through a reduced-pool seed-1 check. Exact-copy symmetric warm starts rank first in every completed 16-step probe and in the completed stochastic 128-step continuations at seed-0 steps 1000 and 2000 plus reduced seed-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
