Goodhart's Law Applies to NLP's Explanation Benchmarks
Jennifer Hsia, Danish Pruthi, Aarti Singh, Zachary C. Lipton

TL;DR
This paper critically examines NLP explanation benchmarks, revealing they can be manipulated without changing model predictions, which questions their reliability for guiding explainability research.
Contribution
It demonstrates that existing explanation metrics like ERASER and EVAL-X can be arbitrarily inflated, exposing their limitations and prompting a reassessment of evaluation standards.
Findings
Metrics can be inflated without changing model predictions
Current benchmarks are vulnerable to simple manipulations
Results question the reliability of explanation metrics
Abstract
Despite the rising popularity of saliency-based explanations, the research community remains at an impasse, facing doubts concerning their purpose, efficacy, and tendency to contradict each other. Seeking to unite the community's efforts around common goals, several recent works have proposed evaluation metrics. In this paper, we critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics, focusing our inquiry on natural language processing. First, we show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our strategy exploits the tendency for extracted explanations and their complements to be "out-of-support" relative to each other and in-distribution inputs. Next, we demonstrate that the EVAL-X metrics can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Scientific Computing and Data Management
