You can remove GPT2's LayerNorm by fine-tuning

Stefan Heimersheim

arXiv:2409.13710·cs.CL·November 19, 2024

You can remove GPT2's LayerNorm by fine-tuning

Stefan Heimersheim

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper demonstrates that GPT2's LayerNorm layers can be removed through fine-tuning without significant performance loss, simplifying models for interpretability research.

Contribution

It shows that LayerNorm can be eliminated from GPT2-small models via fine-tuning, challenging the necessity of LayerNorm for model performance.

Findings

01

LN-free GPT2 achieves similar performance to original on key datasets

02

Removing LN simplifies the model for interpretability

03

Fine-tuning requires only 500M tokens

Abstract

The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some researchers have gone so far as to name "reasons interpretability researchers hate layer norm." In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apolloresearch/gpt2_noln
pytorchOfficial

Models

🤗
apollo-research/gpt2_noLN
model· 154 dl· ♡ 4
154 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques

MethodsRoot Mean Square Layer Normalization