Unifying Learning Dynamics and Generalization in Transformers Scaling Law
Chiwun Yang

TL;DR
This paper develops a theoretical framework for understanding how transformer models' generalization improves with scale, linking learning dynamics, risk convergence, and the impact of computational resources in a unified manner.
Contribution
It introduces a formal ODE-based model of transformer learning dynamics, analyzes stochastic gradient descent in realistic settings, and derives phase transition-based bounds on generalization error.
Findings
Exponential decay of excess risk in initial optimization phase.
Power-law decay of generalization error after a resource threshold.
Unified scaling laws for model size, training time, and dataset size.
Abstract
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Topic Modeling · Machine Learning in Materials Science
