Faithfulness and the Notion of Adversarial Sensitivity in NLP   Explanations

Supriya Manna; Niladri Sett

arXiv:2409.17774·cs.CL·December 2, 2024

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Supriya Manna, Niladri Sett

PDF

Open Access

TL;DR

This paper introduces Adversarial Sensitivity, a novel method for evaluating faithfulness in NLP explanations by measuring how explanations respond to adversarial attacks, addressing biases in existing techniques.

Contribution

The paper proposes a new faithfulness evaluation approach based on adversarial sensitivity, improving reliability and addressing limitations of current methods.

Findings

01

Adversarial Sensitivity effectively captures explanation robustness.

02

The method reveals discrepancies in existing faithfulness metrics.

03

It provides a more reliable assessment of explanation quality.

Abstract

Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEpistemology, Ethics, and Metaphysics · Psychology of Moral and Emotional Judgment