Taming Transformer Without Using Learning Rate Warmup
Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

TL;DR
This paper introduces a theoretical analysis of Transformer training failures related to spectral energy concentration and proposes a novel optimization strategy to enable training without learning rate warmup, demonstrating effectiveness across multiple models.
Contribution
The paper provides a new theoretical understanding of Transformer training crashes and introduces a spectral energy-based optimization method to train Transformers without warmup.
Findings
Effective training of ViT, Swin-Transformer, and GPT without warmup.
Prevents spectral energy concentration and entropy collapse.
Improves training stability and efficiency.
Abstract
Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of , which is the reason for a malignant entropy collapse, where and are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio is larger than a…
Peer Reviews
Decision·ICLR 2025 Poster
This paper demonstrates strong originality, with the necessary mathematical definitions and proofs included as required. It is well-structured and organized clearly. The essential mathematical framework is presented both in the main text and supplementary material. Additionally, experiments have been conducted to verify the theoretical claims made in the paper.
This paper has several areas for improvement in terms of writing, particularly regarding the use of mathematical notations in the abstract without proper definitions. A similar issue is evident in the Introduction as well. The experiments are inadequate; the paper should incorporate additional baseline models for comparison, as it currently only discusses or modifies three models. It would be advantageous to present empirical results using well-known recent models, such as those with linear com
- The analysis of the query (transpose), key weight matrix product is interesting. Specifically, using the singular values of this product matrix to show collapse is interesting - Application of Weyl's inequality to the update equation for the optimizer to stabilize Transformer training is a nice development - The biggest contribution of this paper is that the proposed algorithm built on analysis frees up a practitioner to use learning rate warmup.
- The paper contains limited empirical results. Specifically, the experimental setup used to study the algorithm is small and no results are provided with larger models (0.5B, 1B, 3B). This limits the contribution as its not clear how the proposed algorithm will work with larger models (albeit can still be considered ``small'') where training instabilities are more apparent - Related to above, the QK layer norm paper by Deghani et al. (2023) show that training instabilities occur at 8B for visi
1 .The authors propose an alternate method for transformer training that does not need learning rate warmup.
1. The choice of parameter norms for tracking training dynamics seems arbitrary. It’s already established in literature that parameter norms diverge when training fails, so this observation does not seem novel [1]. The paper could benefit from a deeper analysis or rationale for the specific parameters chosen, or alternatively, from an exploration of novel insights that could provide a more compelling argument. 2. While the authors claim that the attention maps are sparse and low-rank, the plots
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
