On the Convergence of Encoder-only Shallow Transformers

Yongtao Wu; Fanghui Liu; Grigorios G Chrysos; Volkan Cevher

arXiv:2311.01575·cs.LG·November 6, 2023·1 cites

On the Convergence of Encoder-only Shallow Transformers

Yongtao Wu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher

PDF

Open Access 1 Video

TL;DR

This paper develops a theoretical framework for understanding the convergence of encoder-only shallow Transformers, analyzing the effects of scaling, initialization, and overparameterization on training dynamics.

Contribution

It provides the first global convergence theory for shallow Transformers in a realistic setting, highlighting the roles of scaling schemes and initialization.

Findings

01

Quadratic overparameterization ensures convergence with common initializations.

02

Different scaling schemes significantly affect training dynamics.

03

NTK analysis offers a comprehensive comparison of convergence behaviors.

Abstract

In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Convergence of Encoder-only Shallow Transformers· slideslive

Taxonomy

TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Model Reduction and Neural Networks

MethodsAttention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing