Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

Tuomas Oikarinen; Ge Yan; Akshay Kulkarni; Tsui-Wei Weng

arXiv:2506.07985·cs.CV·December 4, 2025

Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

Tuomas Oikarinen, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

PDF

Open Access 1 Repo

TL;DR

This paper introduces cost-effective crowd-sourced evaluation techniques for automated interpretability methods, significantly reducing evaluation costs and improving reliability, and demonstrates their effectiveness through a large-scale study on vision networks.

Contribution

It presents Model-Guided Importance Sampling and Bayesian Rating Aggregation to improve the efficiency and reliability of crowdsourced interpretability evaluations.

Findings

01

Evaluation cost reduced by ~40x

02

Input selection reduces required inputs by ~13x

03

Rating noise mitigated, reducing ratings needed by ~3x

Abstract

Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trustworthy-ml-lab/efficient_neuron_eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis