On the Convergence of Encoder-only Shallow Transformers
Yongtao Wu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher

TL;DR
This paper develops a theoretical framework for understanding the convergence of encoder-only shallow Transformers, analyzing the effects of scaling, initialization, and overparameterization on training dynamics.
Contribution
It provides the first global convergence theory for shallow Transformers in a realistic setting, highlighting the roles of scaling schemes and initialization.
Findings
Quadratic overparameterization ensures convergence with common initializations.
Different scaling schemes significantly affect training dynamics.
NTK analysis offers a comprehensive comparison of convergence behaviors.
Abstract
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Model Reduction and Neural Networks
MethodsAttention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
