Impact of Layer Norm on Memorization and Generalization in Transformers

Rishi Singhal; Jung-Eun Kim

arXiv:2511.10566·cs.LG·November 14, 2025

Impact of Layer Norm on Memorization and Generalization in Transformers

Rishi Singhal, Jung-Eun Kim

PDF

Open Access 1 Video

TL;DR

This paper investigates how Layer Normalization affects memorization and learning in transformer architectures, revealing its critical role in stabilizing training and influencing memorization, with implications for model design.

Contribution

It provides a detailed analysis of LayerNorm's impact on memorization and learning in Pre- and Post-LayerNorm transformers, highlighting the importance of early layer normalization.

Findings

01

LayerNorm stabilizes learning in Pre-LayerNorm transformers.

02

Removing LayerNorm increases memorization in Pre-LayerNorm models.

03

In Post-LayerNorm models, removing LayerNorm reduces memorization.

Abstract

Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Impact of Layer Norm on Memorization and Generalization in Transformers· slideslive

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Advanced Neural Network Applications