A Causal Lens for Evaluating Faithfulness Metrics

Kerem Zaman; Shashank Srivastava

arXiv:2502.18848·cs.CL·December 29, 2025

A Causal Lens for Evaluating Faithfulness Metrics

Kerem Zaman, Shashank Srivastava

PDF

Open Access 6 Models 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces Causal Diagnosticity, a framework for evaluating faithfulness metrics of natural language explanations in LLMs, revealing variability in metric performance across tasks and models.

Contribution

It proposes a principled benchmark using model editing to generate explanation pairs, enabling systematic comparison of faithfulness metrics.

Findings

01

Filler Tokens metric performs best overall

02

Continuous metrics are more diagnostic than binary ones

03

Performance of metrics varies across tasks and models

Abstract

Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's true reasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. Originality: The paper introduces a novel approach that uses causal model editing to generate faithful-unfaithful explanation pairs, offering a rigorous basis for assessing faithfulness in natural language explanations. This approach combines causality with faithfulness evaluation and tries to get to the model’s true reasoning processes. 2. Quality: The paper is rigorous, with comprehensive experiments across three tasks and multiple language models. The inclusion of alternative model editin

Weaknesses

1. The use of synthetic explanations may be limiting, as these pairs might not fully represent actual model-generated explanations. It would be helpful if the authors provided an analysis of how well synthetic explanations align with actual ones. 2. The focus on three specific tasks (fact-checking, analogy, object counting) may not generalize well to more complex contexts. Adding diverse tasks or discussing broader applicability would be helpful. Have the authors considered experimenting with ot

Reviewer 02Rating 5Confidence 3

Strengths

1. The paper is well-written - the motivation of the work is clearly presented, related works are well discussed, proposed approach and experiments are clearly described, and results are well discussed. 2. The topic the paper focuses on is extremly important. Given the widespread usage of LLMs, it is very important to develop faithful methods to explain their predictions, but it is equally important to benchmark them. 3. The experiments are diverse and include ablation studies to understand if

Weaknesses

1. The paper seems very applied to me with limited novelty. The authors expand an existing metric (called diagnosticity) to natural language explanations by arguing that random text cannot work as meaningful explanation (line 188..). However, this argument needs more backing/examples as random text can be considered as unfaithful explanation as done previously by Chan et al. 2022b. 2. Secondly, the authors introduce model editing as a way to generate pair of explanations (faithful and unfaithfu

Reviewer 03Rating 3Confidence 4

Strengths

The framework on evaluating faithfulness metrics for natural language explanations is quite novel. The use of model editing to create the three synthetic tasks is also very novel. Extensive evaluations of several different faithfulness metrics are used.

Weaknesses

My biggest concern is with the generation of synthetic explanations, and the assumption that one is correct and the other is incorrect. In particular, while the model is edited on the particular fact, it is unclear that the particular editing causes the model to use the "intended" reasoning path, or the model is actually using some very different reasoning paths. For example, in the Rihanna example, it could be that the model editing removes "Rihanna" entity from the "singer set", and hence resu

Code & Models

Models

Datasets

l3-unc/CausalDiagnosticity
dataset· 240 dl
240 dl

Videos

A Causal Lens for Evaluating Faithfulness Metrics· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling