Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations
Supriya Manna, Niladri Sett

TL;DR
This paper introduces Adversarial Sensitivity, a novel method for evaluating faithfulness in NLP explanations by measuring how explanations respond to adversarial attacks, addressing biases in existing techniques.
Contribution
The paper proposes a new faithfulness evaluation approach based on adversarial sensitivity, improving reliability and addressing limitations of current methods.
Findings
Adversarial Sensitivity effectively captures explanation robustness.
The method reveals discrepancies in existing faithfulness metrics.
It provides a more reliable assessment of explanation quality.
Abstract
Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEpistemology, Ethics, and Metaphysics · Psychology of Moral and Emotional Judgment
