The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
Yongzhong Xu

TL;DR
This paper extends geometric analysis of grokking to multi-task Transformers, revealing systematic phenomena like staggered generalization, invariant manifolds, and the impact of weight decay on phase structure and redundancy.
Contribution
It introduces a geometric framework for understanding multi-task grokking, highlighting phenomena such as staggered order, invariant manifolds, and the role of weight decay and redundancy.
Findings
Grokking order varies systematically across tasks.
Optimization trajectories are confined to low-dimensional manifolds.
Weight decay influences grokking timescales and phase structure.
Abstract
Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
