Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training
Yongzhong Xu

TL;DR
This paper uncovers a dominant low-dimensional drift in transformer training trajectories driven by optimizer dynamics, revealing how accumulated updates shape training geometry beyond instantaneous gradients.
Contribution
It introduces the concept of a stable low-dimensional backbone in transformer training trajectories caused by optimizer effects, highlighting the importance of cumulative dynamics over instantaneous gradients.
Findings
A dominant low-dimensional drift captures 60-80% of displacement.
Replacing AdamW with SGD removes the drift structure.
Reducing β₂ degrades backbone dominance and recoverability.
Abstract
We analyze cumulative parameter trajectories of transformer training under AdamW and identify a dominant low-dimensional drift direction ("backbone") that captures 60--80% of long-horizon displacement from initialization. This direction is highly stable over rolling training windows yet reorients gradually across phases, particularly following objective reweighting. Per-batch gradients exhibit near-noise-floor alignment with the backbone, whereas optimizer-integrated updates align strongly with it, indicating that the structure emerges from accumulated optimizer dynamics rather than instantaneous gradient geometry. Replacing AdamW with SGD-family optimizers eliminates this structure, and reducing smoothly degrades backbone dominance and reheating recoverability. Reheating experiments show that transverse probe modes can be transiently re-excited without substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Diffusion and Search Dynamics · Neural Networks and Reservoir Computing
