Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks
Yongzhong Xu

TL;DR
This paper reveals that transformer training dynamics in high-dimensional spaces rapidly collapse onto low-dimensional manifolds, providing insights into their geometric structure and implications for interpretability and training efficiency.
Contribution
It uncovers the low-dimensional geometric structure of transformer learning trajectories and links this to phenomena like attention concentration and parameter efficiency.
Findings
Training trajectories collapse onto 3-4 dimensional manifolds.
Attention saturation occurs along routing coordinates within the manifold.
SGD commutator alignment with the manifold decreases as training progresses.
Abstract
We investigate the geometric structure of learning dynamics in overparameterized transformer models through carefully controlled modular arithmetic tasks. Our primary finding is that despite operating in high-dimensional parameter spaces (), transformer training trajectories rapidly collapse onto low-dimensional execution manifolds of dimension --. This dimensional collapse is robust across random seeds and moderate task difficulties, though the orientation of the manifold in parameter space varies between runs. We demonstrate that this geometric structure underlies several empirically observed phenomena: (1) sharp attention concentration emerges as saturation along routing coordinates within the execution manifold, (2) SGD commutators are preferentially aligned with the execution subspace (up to random baseline) early in training, with of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Quantum many-body systems
