Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Tuomas Oikarinen; Ge Yan; Tsui-Wei Weng

arXiv:2506.05774·cs.LG·June 9, 2025

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng

PDF

Open Access 1 Video

TL;DR

This paper introduces a unified framework for evaluating neuron explanations in neural networks, highlighting the reliability issues of current metrics and proposing guidelines for more trustworthy evaluation methods.

Contribution

It unifies existing explanation evaluation methods into a single mathematical framework and proposes sanity checks to identify reliable metrics.

Findings

01

Many existing metrics fail sanity checks.

02

Reliable metrics should change scores after concept label modifications.

03

Guidelines for future evaluation practices are proposed.

Abstract

Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Cell Image Analysis Techniques