TL;DR
This paper demonstrates that layer normalization layers can be removed from GPT-2 models with minimal impact on performance, simplifying interpretability and scaling to larger models.
Contribution
It shows that LN layers are not essential at inference time for GPT-2, enabling the creation of LN-free models and advancing mechanistic interpretability research.
Findings
LN removal causes only a small increase in validation loss (+0.03 cross-entropy for GPT-2 XL)
LN-free models maintain comparable performance and interpretability features
Scaling of fine-tuning data for LN removal is sublinear with model size
Abstract
Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here, we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus, LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗schaeff/gpt2-small_vanilla300model· 5 dl5 dl
- 🤗schaeff/gpt2-medium_vanilla500model· 5 dl5 dl
- 🤗schaeff/gpt2-xl_LNFree800model· 2 dl2 dl
- 🤗schaeff/gpt2-large_LNFree600model· 6 dl6 dl
- 🤗schaeff/gpt2-xl_vanilla800model· 3 dl3 dl
- 🤗schaeff/gpt2-large_vanilla600model· 6 dl6 dl
- 🤗schaeff/gpt2-small_LNFree300model· 73 dl· ♡ 173 dl♡ 1
- 🤗schaeff/gpt2-medium_LNFree500model· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
