DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD
Xianbiao Qi, Marco Chen, Wenjie Xiao, Jiaquan Ye, Yelin He, Chun-Guang Li, Zhouchen Lin

TL;DR
This paper introduces Deeply Normalized Transformers (DNT), enabling training with vanilla momentum SGD by normalization techniques that stabilize gradient distributions, matching the performance of AdamW-based training.
Contribution
DNT's novel normalization approach allows Transformers to be trained effectively with momentum SGD, simplifying training and maintaining high performance.
Findings
DNT outperforms ViT and GPT in experiments.
DNT can be trained with vanilla mSGDW effectively.
Normalization stabilizes gradient distributions in Transformers.
Abstract
Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper tackles a practical and important problem—reducing dependency on adaptive optimizers like Adam—by improving Transformer architectures to work with simpler optimizers. - The theoretical analysis provides a clear connection between normalization placement and Jacobian conditioning, offering intuition for the design of DNT. They also did a comprehensive analysis of different normalization techniques. - Experimental results across both vision and language models demonstrate that DNT n
- The experiments, while promising, are limited in scale (e.g., GPT2-Small/Large, ViT-Large) and lack validation on larger models or diverse datasets to confirm robustness. - The empirical novelty is modest, as the approach primarily reorganizes existing normalization techniques rather than introducing new mechanisms. - The paper does not include detailed ablation studies to isolate the contribution of each normalization type, which would strengthen the empirical validation.
1. Theoretical justification on how each normalization position affects the Jacobian and stabilizes gradient magnitudes. 2. Empirical results showing that DNT trained with mSGDW performs comparably to standard Transformers trained with AdamW, both on ImageNet (ViT) and OpenWebText (GPT2).
1. Theoretical assumptions are idealized: The high-dimensional isotropy and orthogonality assumptions may not hold exactly for real Transformer activations. 2. Similar ideas appear in nGPT, StableTransformer, and Lipsformer (which are cited), but the novelty claim is modest—it’s mostly a systematic integration and justification rather than a new normalization method. 3. The comparison is primarily between mSGD and AdamW. It would be more compelling to see how it performs against other optimizer
1. The paper addresses a well-known problem in training Transformers, namely why momentum SGD (mSGD) tends to fail compared to adaptive optimizers like AdamW. It provides a detailed theoretical analysis showing how the placement of normalization layers affects the conditioning of Jacobian matrices and the variance of gradients, explaining why these adjustments are crucial for stable training. 2. The approach is practical, leveraging existing normalization techniques in new positions without in
1. The paper has several limitations regarding the scope and scale of its experiments. It evaluates the proposed Deeply Normalized Transformer (DNT) only on two benchmarks, ImageNet and OpenWebText, and does not include large-scale or multimodal tasks. This narrow evaluation makes it difficult to assess how well the method generalizes to other domains or to the training of state-of-the-art large models. 2. Another limitation is the increased complexity introduced by multiple normalization place
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
