Pitfalls in Evaluating Interpretability Agents

Tal Haklay; Nikhil Prakash; Sana Pandey; Antonio Torralba; Aaron Mueller; Jacob Andreas; Tamar Rott Shaham; Yonatan Belinkov

arXiv:2603.20101·cs.AI·March 23, 2026

Pitfalls in Evaluating Interpretability Agents

Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov

PDF

Open Access

TL;DR

This paper examines the challenges and pitfalls of evaluating automated interpretability systems, especially those using large language models, highlighting issues with current methods and proposing an intrinsic evaluation approach.

Contribution

The paper identifies key limitations of replication-based evaluation for interpretability agents and introduces an unsupervised intrinsic evaluation method based on component interchangeability.

Findings

01

Replication-based evaluation can be subjective and incomplete.

02

Outcome-based comparisons may obscure the research process.

03

LLM-based systems might reproduce findings via memorization.

Abstract

Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis -- explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling