Understanding Gated Neurons in Transformers from Their Input-Output Functionality
Sebastian Gerstner, Hinrich Sch\"utze

TL;DR
This paper investigates the input-output interactions of neurons in transformer models, revealing layer-specific roles of enrichment and depletion neurons in concept representation and factual recall.
Contribution
It introduces a method analyzing cosine similarity between input and output weights to understand neuron interactions, highlighting layer-dependent neuron functions.
Findings
Enrichment neurons dominate in early-middle layers.
Later layers have more depletion neurons.
Enrichment neurons contribute to concept enrichment and factual recall.
Abstract
Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
