SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Chao Wang; Bei Li; Jiaqi Zhang; Xinyu Liu; Yuchun Fan; Linkun Lyu; Xin Chen; Jingang Wang; Tong Xiao; Peng Pei; Xunliang Cai

arXiv:2601.22580·cs.CL·February 2, 2026

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Chao Wang, Bei Li, Jiaqi Zhang, Xinyu Liu, Yuchun Fan, Linkun Lyu, Xin Chen, Jingang Wang, Tong Xiao, Peng Pei, Xunliang Cai

PDF

Open Access

TL;DR

SpanNorm is a new normalization technique for deep Transformers that combines the stability of PreNorm with the performance of PostNorm, leading to more reliable training and better results.

Contribution

SpanNorm introduces a residual-based normalization method that unifies the benefits of PreNorm and PostNorm, supported by theoretical analysis and empirical validation.

Findings

01

SpanNorm stabilizes training in deep Transformers.

02

SpanNorm improves performance in dense and MoE models.

03

SpanNorm prevents gradient issues and representation collapse.

Abstract

The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks · Advanced Neural Network Applications