Taming Transformer Without Using Learning Rate Warmup

Xianbiao Qi; Yelin He; Jiaquan Ye; Chun-Guang Li; Bojia Zi; Xili Dai; Qin Zou; Rong Xiao

arXiv:2505.21910·cs.LG·May 29, 2025

Taming Transformer Without Using Learning Rate Warmup

Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a theoretical analysis of Transformer training failures related to spectral energy concentration and proposes a novel optimization strategy to enable training without learning rate warmup, demonstrating effectiveness across multiple models.

Contribution

The paper provides a new theoretical understanding of Transformer training crashes and introduces a spectral energy-based optimization method to train Transformers without warmup.

Findings

01

Effective training of ViT, Swin-Transformer, and GPT without warmup.

02

Prevents spectral energy concentration and entropy collapse.

03

Improves training stability and efficiency.

Abstract

Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of $\bW_{q}^{⊤} \bW_{k}$ , which is the reason for a malignant entropy collapse, where $\bW_{q}$ and $\bW_{k}$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio $\frac{σ _{1} ( \nabla \bW _{t} )}{σ _{1} ( \bW _{t - 1} )}$ is larger than a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

This paper demonstrates strong originality, with the necessary mathematical definitions and proofs included as required. It is well-structured and organized clearly. The essential mathematical framework is presented both in the main text and supplementary material. Additionally, experiments have been conducted to verify the theoretical claims made in the paper.

Weaknesses

This paper has several areas for improvement in terms of writing, particularly regarding the use of mathematical notations in the abstract without proper definitions. A similar issue is evident in the Introduction as well. The experiments are inadequate; the paper should incorporate additional baseline models for comparison, as it currently only discusses or modifies three models. It would be advantageous to present empirical results using well-known recent models, such as those with linear com

Reviewer 02Rating 6Confidence 3

Strengths

- The analysis of the query (transpose), key weight matrix product is interesting. Specifically, using the singular values of this product matrix to show collapse is interesting - Application of Weyl's inequality to the update equation for the optimizer to stabilize Transformer training is a nice development - The biggest contribution of this paper is that the proposed algorithm built on analysis frees up a practitioner to use learning rate warmup.

Weaknesses

- The paper contains limited empirical results. Specifically, the experimental setup used to study the algorithm is small and no results are provided with larger models (0.5B, 1B, 3B). This limits the contribution as its not clear how the proposed algorithm will work with larger models (albeit can still be considered ``small'') where training instabilities are more apparent - Related to above, the QK layer norm paper by Deghani et al. (2023) show that training instabilities occur at 8B for visi

Reviewer 03Rating 6Confidence 3

Strengths

1 .The authors propose an alternate method for transformer training that does not need learning rate warmup.

Weaknesses

1. The choice of parameter norms for tracking training dynamics seems arbitrary. It’s already established in literature that parameter norms diverge when training fails, so this observation does not seem novel [1]. The paper could benefit from a deeper analysis or rationale for the specific parameters chosen, or alternatively, from an exploration of novel insights that could provide a more compelling argument. 2. While the authors claim that the attention maps are sparse and low-rank, the plots

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis