On Measuring Faithfulness or Self-consistency of Natural Language Explanations
Letitia Parcalabescu, Anette Frank

TL;DR
This paper argues that current faithfulness tests for LLM explanations actually measure self-consistency, introduces a new benchmark and a fine-grained measure CC-SHAP to better assess model faithfulness.
Contribution
It clarifies the distinction between faithfulness and self-consistency, and introduces CC-SHAP, a detailed measure for analyzing LLM self-consistency and explanation faithfulness.
Findings
Existing tests mainly measure self-consistency, not true faithfulness.
The Comparative Consistency Bank compares tests across 11 LLMs and 5 tasks.
CC-SHAP provides a detailed analysis of input contribution to answers and explanations.
Abstract
Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level. Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks -- including iii) our new self-consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
