Disentangling MLP Neuron Weights in Vocabulary Space
Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva

TL;DR
ROTATE is a data-free method that interprets MLP neurons by optimizing rotations to maximize kurtosis in vocabulary space, revealing interpretable directions called vocabulary channels.
Contribution
It introduces a novel, data-free approach to disentangle neuron weights in language models by leveraging kurtosis, enabling scalable and fine-grained interpretability.
Findings
ROTATE recovers vocabulary channels faithful to neuron behavior.
Ablating channels disables specific neuron functions.
Channel descriptions outperform activation-based baselines 2-3x.
Abstract
Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
