A Causal Lens for Evaluating Faithfulness Metrics
Kerem Zaman, Shashank Srivastava

TL;DR
This paper introduces Causal Diagnosticity, a framework for evaluating faithfulness metrics of natural language explanations in LLMs, revealing variability in metric performance across tasks and models.
Contribution
It proposes a principled benchmark using model editing to generate explanation pairs, enabling systematic comparison of faithfulness metrics.
Findings
Filler Tokens metric performs best overall
Continuous metrics are more diagnostic than binary ones
Performance of metrics varies across tasks and models
Abstract
Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's true reasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Originality: The paper introduces a novel approach that uses causal model editing to generate faithful-unfaithful explanation pairs, offering a rigorous basis for assessing faithfulness in natural language explanations. This approach combines causality with faithfulness evaluation and tries to get to the model’s true reasoning processes. 2. Quality: The paper is rigorous, with comprehensive experiments across three tasks and multiple language models. The inclusion of alternative model editin
1. The use of synthetic explanations may be limiting, as these pairs might not fully represent actual model-generated explanations. It would be helpful if the authors provided an analysis of how well synthetic explanations align with actual ones. 2. The focus on three specific tasks (fact-checking, analogy, object counting) may not generalize well to more complex contexts. Adding diverse tasks or discussing broader applicability would be helpful. Have the authors considered experimenting with ot
1. The paper is well-written - the motivation of the work is clearly presented, related works are well discussed, proposed approach and experiments are clearly described, and results are well discussed. 2. The topic the paper focuses on is extremly important. Given the widespread usage of LLMs, it is very important to develop faithful methods to explain their predictions, but it is equally important to benchmark them. 3. The experiments are diverse and include ablation studies to understand if
1. The paper seems very applied to me with limited novelty. The authors expand an existing metric (called diagnosticity) to natural language explanations by arguing that random text cannot work as meaningful explanation (line 188..). However, this argument needs more backing/examples as random text can be considered as unfaithful explanation as done previously by Chan et al. 2022b. 2. Secondly, the authors introduce model editing as a way to generate pair of explanations (faithful and unfaithfu
The framework on evaluating faithfulness metrics for natural language explanations is quite novel. The use of model editing to create the three synthetic tasks is also very novel. Extensive evaluations of several different faithfulness metrics are used.
My biggest concern is with the generation of synthetic explanations, and the assumption that one is correct and the other is incorrect. In particular, while the model is edited on the particular fact, it is unclear that the particular editing causes the model to use the "intended" reasoning path, or the model is actually using some very different reasoning paths. For example, in the Rihanna example, it could be that the model editing removes "Rihanna" entity from the "singer set", and hence resu
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
