Global Convergence in Training Large-Scale Transformers
Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang and, Han Liu, Jason Matthew Klusowski, Jianqing Fan

TL;DR
This paper provides a rigorous theoretical analysis of the convergence properties of large-scale Transformers during training, establishing conditions under which gradient flow converges to a global minimum in the mean-field limit.
Contribution
It introduces novel mean-field techniques for analyzing Transformers, extending existing tools to partial homogeneity and local Lipschitz smoothness assumptions.
Findings
Gradient flow converges to a global minimum with small weight decay.
Mean-field limit of large Transformers is characterized by a Wasserstein PDE.
New analytical techniques for Transformers may be useful for future research.
Abstract
Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small. Our analysis is based on a series of novel mean-field techniques that adapt to Transformers. Compared with existing tools for deep networks (Lu et al., 2020) that demand homogeneity and global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
MethodsWeight Decay
