Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Ziqian Zhong; Aashiq Muhamed; Mona T. Diab; Virginia Smith; Aditi Raghunathan

arXiv:2604.11061·cs.LG·April 14, 2026

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

PDF

TL;DR

Pando is a benchmark that tests interpretability methods on models with controlled explanations, revealing when white-box tools outperform black-box elicitation and providing insights into model decision processes.

Contribution

Introduces Pando, a benchmark with models trained to produce faithful, no, or unfaithful explanations, to evaluate interpretability methods under different explanation conditions.

Findings

01

When explanations are faithful, black-box elicitation matches white-box methods.

02

Gradient-based attribution improves accuracy when explanations are absent or misleading.

03

Relevance patching (RelP) yields the largest gains among interpretability tools.

Abstract

Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.