Impact of Layer Norm on Memorization and Generalization in Transformers
Rishi Singhal, Jung-Eun Kim

TL;DR
This paper investigates how Layer Normalization affects memorization and learning in transformer architectures, revealing its critical role in stabilizing training and influencing memorization, with implications for model design.
Contribution
It provides a detailed analysis of LayerNorm's impact on memorization and learning in Pre- and Post-LayerNorm transformers, highlighting the importance of early layer normalization.
Findings
LayerNorm stabilizes learning in Pre-LayerNorm transformers.
Removing LayerNorm increases memorization in Pre-LayerNorm models.
In Post-LayerNorm models, removing LayerNorm reduces memorization.
Abstract
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Advanced Neural Network Applications
