Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Jeonghoon Kim; Byeongchan Lee; Cheonbok Park; Yeontaek Oh; Beomjun Kim; Taehwan Yoo; Seongjin Shin; Dongyoon Han; Jinwoo Shin; Kang Min Yoo

arXiv:2502.02732·cs.LG·June 9, 2025

Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

PDF

Open Access 1 Video

TL;DR

This paper analyzes how different layer normalization strategies affect training stability and convergence in large-scale Transformers, introducing and validating a new Peri-LN approach that improves variance control and gradient flow.

Contribution

It provides a comprehensive theoretical analysis of LN placement strategies and introduces Peri-LN, a novel normalization placement that enhances training stability in large Transformers.

Findings

01

Peri-LN achieves more balanced activation variance.

02

Peri-LN results in steadier gradient propagation.

03

Peri-LN improves convergence stability in large models.

Abstract

Selecting a layer normalization (LN) strategy that stabilizes training and speeds convergence in Transformers remains difficult, even for today's large language models (LLM). We present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformers. Until recently, Pre-LN and Post-LN have long dominated practices despite their limitations in large-scale training. However, several open-source models have recently begun silently adopting a third strategy without much explanation. This strategy places normalization layer peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis delineates the distinct behaviors of LN strategies, showing how each placement shapes activation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Peri-LN: Revisiting Normalization Layer in the Transformer Architecture· slideslive

Taxonomy

TopicsMagnetic Properties and Applications

MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Residual Connection · Dense Connections · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax