Learning to Deceive with Attention-Based Explanations

Danish Pruthi; Mansi Gupta; Bhuwan Dhingra; Graham Neubig; Zachary C.; Lipton

arXiv:1909.07913·cs.CL·April 8, 2020

Learning to Deceive with Attention-Based Explanations

Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, Zachary C., Lipton

PDF

3 Repos

TL;DR

This paper demonstrates that attention mechanisms in neural models can be manipulated to produce deceptive explanations, raising concerns about their reliability for interpretability and fairness auditing.

Contribution

The authors introduce a simple method to train models that produce misleading attention explanations without significantly affecting predictive accuracy.

Findings

01

Manipulated attention masks can hide reliance on sensitive features.

02

Deceptive explanations can fool human evaluators.

03

Attention weights are unreliable indicators of model reasoning.

Abstract

Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mechanisms into question by demonstrating a simple method for training models to produce deceptive attention masks. Our method diminishes the total weight assigned to designated impermissible tokens, even when the models can be shown to nevertheless rely on these features to drive predictions. Across multiple models and tasks, our approach manipulates attention weights while paying surprisingly little cost in accuracy. Through a human study, we show that our manipulated attention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.