Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models
Wei Jie Yeo, Ranjan Satapathy, Erik Cambria

TL;DR
This paper introduces Causal Faithfulness, a new metric based on activation patching, to better measure the faithfulness of explanations generated by large language models, addressing limitations of previous methods.
Contribution
It proposes a causal mediation technique for faithfulness measurement, demonstrating its effectiveness across various model sizes and tuning states, and highlighting its advantages over existing tests.
Findings
Models with alignment tuning produce more faithful explanations.
Causal Faithfulness outperforms existing faithfulness metrics.
The method accounts for internal model computations and avoids out-of-distribution issues.
Abstract
Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
