The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model
Brenden Smith, Dallin Baker, Clayton Chase, Myles Barney, Kaden, Parker, Makenna Allred, Peter Hu, Alex Evans, Nancy Fulda

TL;DR
This paper introduces the Injectable Realignment Model (IRM), a novel interpretability approach for LLMs that reveals internal neuron activation patterns related to alignment behaviors, uncovering a key neuron linked to model alignment characteristics.
Contribution
The paper presents IRM as a new method for analyzing LLM internal mechanisms by injecting alignment signals, revealing neuron-specific patterns linked to model behavior.
Findings
Neuron 1512 strongly correlates with alignment behaviors.
IRM activations form patterns associated with neuron indices, not layers.
Design choices in transformer architectures influence neuron activation patterns.
Abstract
Large Language Models (LLMs) have an unrivaled and invaluable ability to "align" their output to a diverse range of human preferences, by mirroring them in the text they generate. The internal characteristics of such models, however, remain largely opaque. This work presents the Injectable Realignment Model (IRM) as a novel approach to language model interpretability and explainability. Inspired by earlier work on Neural Programming Interfaces, we construct and train a small network -- the IRM -- to induce emotion-based alignments within a 7B parameter LLM architecture. The IRM outputs are injected via layerwise addition at various points during the LLM's forward pass, thus modulating its behavior without changing the weights of the original model. This isolates the alignment behavior from the complex mechanisms of the transformer model. Analysis of the trained IRM's outputs reveals a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Neurological disorders and treatments · Axon Guidance and Neuronal Signaling
MethodsALIGN · LLaMA
