Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

Ge Yan; Tuomas Oikarinen; Tsui-Wei (Lily) Weng

arXiv:2512.18092·cs.AI·December 23, 2025

Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

Ge Yan, Tuomas Oikarinen, Tsui-Wei (Lily) Weng

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical foundation for neuron identification in deep networks, ensuring faithfulness and stability of explanations through generalization bounds and ensemble methods, advancing trustworthy interpretability.

Contribution

It introduces the first theoretical analysis of neuron explanation faithfulness and stability, with guarantees derived from generalization bounds and a bootstrap ensemble approach.

Findings

01

Theoretical guarantees for neuron explanation faithfulness using similarity metrics.

02

A bootstrap ensemble method quantifies stability with coverage guarantees.

03

Experimental validation on synthetic and real data supports the theoretical results.

Abstract

Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron's underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- This work presents a novel and insightful perspective by interpreting the neuron identification problem as an inverse process of machine learning. This conceptual shift provides a fresh way to reason about how neurons emerge and can be systematically explained within learned models. - This work strengthens the theoretical foundation of neuron interpretation by addressing both faithfulness and stability. It provides analytical support for the reliability of neuron-level explanations and offers

Weaknesses

- This work concludes that CLIP Dissect tends to identify more abstract concepts, whereas NetDissect captures more concrete ones. However, it remains unclear whether the probing set and concept set used for this comparison are consistent across both methods. Specifically, CLIP Dissect employs images from the model’s training data as the probing set and utilizes a concept set of approximately 2K words, while NetDissect uses the Broden dataset, which includes segmentation mask annotations but cont

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper provides a theoretical analysis of neuron explanations, establishing formal bounds for various concept–neuron similarity metrics to ensure faithfulness, and deriving concept prediction probabilistic guarantees for stability. 2. The idea of treating neuron identification as an inverse machine learning process is a novel insight that enables the adaption of the generalization theory and helps justify the reliability of neuron explanations.

Weaknesses

1. Although the paper presents explicit theoretical analyses, the empirical validation is limited and does not sufficiently demonstrate the effectiveness of the proposed theorems. For faithfulness, only simple binary cases and synthetic simulations are provided; for stability, the paper includes only two visualization examples, and the additional results in the Appendix appear the same to those in the main paper. 2. The paper would benefit from a more comprehensive empirical verification, includ

Reviewer 03Rating 4Confidence 2

Strengths

The authors identifie a genuine gap—neuron identification lacked a defensible definition of faithfulness—and provides a principled formulation together with a stability construct. Treating the task as an inverse problem is a fresh angle that connects interpretability with classical generalization theory. Moreover, it reframes neuron identification as an inverse problem with explicit statistical guarantees—novel for this literature. The theory is sound and broadly applicable: uniform convergence

Weaknesses

Empirical breadth and depth are insufficient. - Experiments are concentrated on a single architecture and a narrow set of baselines; this limits external validity. Including diverse backbones (e.g., ViT/ConvNeXt, larger CNNs) and multiple concept sources would better test generality. - The paper promises additional results in the appendix C, but I could not find no additional results at figure 6 of appendix C - The authors should include qualitative results on how faithfulness difference affects

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis