Rigorously Assessing Natural Language Explanations of Neurons
Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu,, Christopher Potts

TL;DR
This paper introduces two evaluation methods for natural language explanations of neurons in language models, revealing that many explanations lack accuracy and causal effectiveness, and questions the suitability of natural language for explanations.
Contribution
The paper develops observational and intervention evaluation modes for natural language neuron explanations, providing a rigorous framework to assess their faithfulness and causal validity.
Findings
Most explanations have high error rates.
Few explanations demonstrate causal efficacy.
Questions the suitability of natural language for explanations.
Abstract
Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron activates on all and only input strings that refer to a concept picked out by the proposed explanation . In the intervention mode, we construe as a claim that the neuron is a causal mediator of the concept denoted by . We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Materials Science
MethodsAttention Is All You Need · Residual Connection · Adam · Discriminative Fine-Tuning · Weight Decay · Dropout · Cosine Annealing · Linear Layer · Layer Normalization · Multi-Head Attention
