Do Models Explain Themselves? Counterfactual Simulatability of Natural   Language Explanations

Yanda Chen; Ruiqi Zhong; Narutatsu Ri; Chen Zhao; He He; Jacob; Steinhardt; Zhou Yu; Kathleen McKeown

arXiv:2307.08678·cs.CL·July 18, 2023·5 cites

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob, Steinhardt, Zhou Yu, Kathleen McKeown

PDF

Open Access 1 Video

TL;DR

This paper introduces the concept of counterfactual simulatability to evaluate whether natural language explanations from large language models enable humans to accurately infer the models' outputs on diverse hypothetical inputs, revealing limitations in current explanations.

Contribution

The paper proposes new metrics for assessing explanation quality based on counterfactual inference and evaluates state-of-the-art LLMs, uncovering their explanations' low precision and limited usefulness for understanding model behavior.

Findings

01

LLM explanations have low precision in enabling humans to predict model outputs.

02

Precision of explanations does not correlate with their perceived plausibility.

03

Naive optimization of explanations for human approval may be insufficient.

Abstract

Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $counterfactual simulatability$ of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input. For example, if a model answers "yes" to the input question "Can eagles fly?" with the explanation "all birds can fly", then humans would infer from the explanation that it would also answer "yes" to the counterfactual input "Can penguins fly?". If the explanation is precise, then the model's answer should match humans' expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations· slideslive

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)

MethodsCounterfactuals Explanations