On the Interaction of Belief Bias and Explanations
Ana Valeria Gonzalez, Anna Rogers, Anders S{\o}gaard

TL;DR
This paper examines how belief bias influences human evaluation of explainability methods in NLP, demonstrating that accounting for prior beliefs can significantly alter conclusions about method performance.
Contribution
It highlights the impact of belief bias on human evaluation of explanations and proposes methods to control for it in NLP explainability assessments.
Findings
Belief bias affects human judgments of explanation quality.
Controlling for prior beliefs changes evaluation outcomes.
Simple methods can mitigate belief bias in human assessments.
Abstract
A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn't clear how such metrics reflect human interaction with explanations. Human evaluation is of paramount importance, but previous protocols fail to account for belief biases affecting human performance, which may lead to misleading conclusions. We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it. For two experimental paradigms, we present a case study of gradient-based explainability introducing simple ways to account for humans' prior beliefs: models of varying quality and adversarial examples. We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
