The Mysterious Case of Neuron 1512: Injectable Realignment Architectures   Reveal Internal Characteristics of Meta's Llama 2 Model

Brenden Smith; Dallin Baker; Clayton Chase; Myles Barney; Kaden; Parker; Makenna Allred; Peter Hu; Alex Evans; Nancy Fulda

arXiv:2407.03621·cs.CL·July 8, 2024

The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

Brenden Smith, Dallin Baker, Clayton Chase, Myles Barney, Kaden, Parker, Makenna Allred, Peter Hu, Alex Evans, Nancy Fulda

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Injectable Realignment Model (IRM), a novel interpretability approach for LLMs that reveals internal neuron activation patterns related to alignment behaviors, uncovering a key neuron linked to model alignment characteristics.

Contribution

The paper presents IRM as a new method for analyzing LLM internal mechanisms by injecting alignment signals, revealing neuron-specific patterns linked to model behavior.

Findings

01

Neuron 1512 strongly correlates with alignment behaviors.

02

IRM activations form patterns associated with neuron indices, not layers.

03

Design choices in transformer architectures influence neuron activation patterns.

Abstract

Large Language Models (LLMs) have an unrivaled and invaluable ability to "align" their output to a diverse range of human preferences, by mirroring them in the text they generate. The internal characteristics of such models, however, remain largely opaque. This work presents the Injectable Realignment Model (IRM) as a novel approach to language model interpretability and explainability. Inspired by earlier work on Neural Programming Interfaces, we construct and train a small network -- the IRM -- to induce emotion-based alignments within a 7B parameter LLM architecture. The IRM outputs are injected via layerwise addition at various points during the LLM's forward pass, thus modulating its behavior without changing the weights of the original model. This isolates the alignment behavior from the complex mechanisms of the transformer model. Analysis of the trained IRM's outputs reveals a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dragnlabs/injectable-alignment-model
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques · Neurological disorders and treatments · Axon Guidance and Neuronal Signaling

MethodsALIGN · LLaMA