Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng

TL;DR
This paper introduces a unified framework for evaluating neuron explanations in neural networks, highlighting the reliability issues of current metrics and proposing guidelines for more trustworthy evaluation methods.
Contribution
It unifies existing explanation evaluation methods into a single mathematical framework and proposes sanity checks to identify reliable metrics.
Findings
Many existing metrics fail sanity checks.
Reliable metrics should change scores after concept label modifications.
Guidelines for future evaluation practices are proposed.
Abstract
Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Cell Image Analysis Techniques
