On the Convergence of Gradient Descent on Learning Transformers with Residual Connections
Zhen Qin, Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

TL;DR
This paper provides a theoretical analysis of the convergence behavior of gradient descent on single-layer and multi-layer Transformers with residual connections, highlighting their role in improving optimization stability.
Contribution
It offers the first convergence analysis of complete Transformer architectures with residuals, demonstrating their impact on training dynamics and stability.
Findings
Gradient descent converges linearly under proper initialization.
Residual connections improve conditioning of the attention output matrix.
Empirical results support the theoretical benefits of residuals in training stability.
Abstract
Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
