Peri-LN: Revisiting Normalization Layer in the Transformer Architecture
Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

TL;DR
This paper analyzes how different layer normalization strategies affect training stability and convergence in large-scale Transformers, introducing and validating a new Peri-LN approach that improves variance control and gradient flow.
Contribution
It provides a comprehensive theoretical analysis of LN placement strategies and introduces Peri-LN, a novel normalization placement that enhances training stability in large Transformers.
Findings
Peri-LN achieves more balanced activation variance.
Peri-LN results in steadier gradient propagation.
Peri-LN improves convergence stability in large models.
Abstract
Selecting a layer normalization (LN) strategy that stabilizes training and speeds convergence in Transformers remains difficult, even for today's large language models (LLM). We present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformers. Until recently, Pre-LN and Post-LN have long dominated practices despite their limitations in large-scale training. However, several open-source models have recently begun silently adopting a third strategy without much explanation. This strategy places normalization layer peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis delineates the distinct behaviors of LN strategies, showing how each placement shapes activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMagnetic Properties and Applications
MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Residual Connection · Dense Connections · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
