Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob, Steinhardt, Zhou Yu, Kathleen McKeown

TL;DR
This paper introduces the concept of counterfactual simulatability to evaluate whether natural language explanations from large language models enable humans to accurately infer the models' outputs on diverse hypothetical inputs, revealing limitations in current explanations.
Contribution
The paper proposes new metrics for assessing explanation quality based on counterfactual inference and evaluates state-of-the-art LLMs, uncovering their explanations' low precision and limited usefulness for understanding model behavior.
Findings
LLM explanations have low precision in enabling humans to predict model outputs.
Precision of explanations does not correlate with their perceived plausibility.
Naive optimization of explanations for human approval may be insufficient.
Abstract
Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input. For example, if a model answers "yes" to the input question "Can eagles fly?" with the explanation "all birds can fly", then humans would infer from the explanation that it would also answer "yes" to the counterfactual input "Can penguins fly?". If the explanation is precise, then the model's answer should match humans' expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)
MethodsCounterfactuals Explanations
