Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

Luca Baroni; Galvin Khara; Joachim Schaeffer; Marat Subkhankulov; Stefan Heimersheim

arXiv:2507.02559·cs.LG·July 4, 2025

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, Stefan Heimersheim

PDF

8 Models

TL;DR

This paper demonstrates that layer normalization layers can be removed from GPT-2 models with minimal impact on performance, simplifying interpretability and scaling to larger models.

Contribution

It shows that LN layers are not essential at inference time for GPT-2, enabling the creation of LN-free models and advancing mechanistic interpretability research.

Findings

01

LN removal causes only a small increase in validation loss (+0.03 cross-entropy for GPT-2 XL)

02

LN-free models maintain comparable performance and interpretability features

03

Scaling of fine-tuning data for LN removal is sublinear with model size

Abstract

Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here, we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus, LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.