Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?
Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

TL;DR
Pando is a benchmark that tests interpretability methods on models with controlled explanations, revealing when white-box tools outperform black-box elicitation and providing insights into model decision processes.
Contribution
Introduces Pando, a benchmark with models trained to produce faithful, no, or unfaithful explanations, to evaluate interpretability methods under different explanation conditions.
Findings
When explanations are faithful, black-box elicitation matches white-box methods.
Gradient-based attribution improves accuracy when explanations are absent or misleading.
Relevance patching (RelP) yields the largest gains among interpretability tools.
Abstract
Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
