On the Pitfalls of Analyzing Individual Neurons in Language Models
Omer Antverg, Yonatan Belinkov

TL;DR
This paper critically examines methods for analyzing individual neurons in language models, highlighting methodological pitfalls and distinguishing between encoded and utilized information, with implications for interpretability research.
Contribution
It identifies two key pitfalls in neuron analysis methods and proposes a simple alternative, improving the understanding of how linguistic information is represented in models.
Findings
Separates probe quality from ranking quality in neuron analysis
Shows that encoded information differs from information used by the model
Evaluates ranking methods with respect to both factors
Abstract
While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons, to show how and in which neurons it is encoded. Among these, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We show two pitfalls in this methodology: 1. It confounds distinct factors: probe quality and ranking quality. We separate them and draw conclusions on each. 2. It focuses on encoded information, rather than information that is used by the model. We show that these are not the same. We compare two recent ranking methods and a simple one we introduce, and evaluate them with regard to both of these aspects.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
