Large-Step Training Dynamics of a Two-Factor Linear Transformer Model
Krishnakumar Balasubramanian

TL;DR
This paper analyzes the dynamics of large-step training in a simplified linear transformer model, revealing how high learning rates can lead to complex behaviors like chaos and divergence instead of convergence.
Contribution
It provides an exact analysis of the training dynamics at large learning rates, uncovering phase transitions and invariant structures that influence convergence and stability.
Findings
Large learning rates can cause chaotic and divergent training behaviors.
The dynamics are characterized by an explicit invariant Chebyshev ellipse.
Finite-step training may settle into cycles or chaos, not just converge.
Abstract
Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(\mu\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<\mu<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
