Towards Faithful Natural Language Explanations: A Study Using Activation   Patching in Large Language Models

Wei Jie Yeo; Ranjan Satapathy; Erik Cambria

arXiv:2410.14155·cs.CL·November 4, 2024

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Causal Faithfulness, a new metric based on activation patching, to better measure the faithfulness of explanations generated by large language models, addressing limitations of previous methods.

Contribution

It proposes a causal mediation technique for faithfulness measurement, demonstrating its effectiveness across various model sizes and tuning states, and highlighting its advantages over existing tests.

Findings

01

Models with alignment tuning produce more faithful explanations.

02

Causal Faithfulness outperforms existing faithfulness metrics.

03

The method accounts for internal model computations and avoids out-of-distribution issues.

Abstract

Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wj210/causal-faithfulness
pytorchOfficial

Videos

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques