Rigorously Assessing Natural Language Explanations of Neurons

Jing Huang; Atticus Geiger; Karel D'Oosterlinck; Zhengxuan Wu,; Christopher Potts

arXiv:2309.10312·cs.CL·September 20, 2023

Rigorously Assessing Natural Language Explanations of Neurons

Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu,, Christopher Potts

PDF

Open Access

TL;DR

This paper introduces two evaluation methods for natural language explanations of neurons in language models, revealing that many explanations lack accuracy and causal effectiveness, and questions the suitability of natural language for explanations.

Contribution

The paper develops observational and intervention evaluation modes for natural language neuron explanations, providing a rigorous framework to assess their faithfulness and causal validity.

Findings

01

Most explanations have high error rates.

02

Few explanations demonstrate causal efficacy.

03

Questions the suitability of natural language for explanations.

Abstract

Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron $a$ activates on all and only input strings that refer to a concept picked out by the proposed explanation $E$ . In the intervention mode, we construe $E$ as a claim that the neuron $a$ is a causal mediator of the concept denoted by $E$ . We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Materials Science

MethodsAttention Is All You Need · Residual Connection · Adam · Discriminative Fine-Tuning · Weight Decay · Dropout · Cosine Annealing · Linear Layer · Layer Normalization · Multi-Head Attention