Perturbing Inputs for Fragile Interpretations in Deep Natural Language   Processing

Sanchit Sinha; Hanjie Chen; Arshdeep Sekhon; Yangfeng Ji; Yanjun Qi

arXiv:2108.04990·cs.CL·September 16, 2021

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Sanchit Sinha, Hanjie Chen, Arshdeep Sekhon, Yangfeng Ji, Yanjun Qi

PDF

Open Access 1 Repo

TL;DR

This paper reveals that popular interpretability methods for NLP models can be easily manipulated through simple word perturbations, exposing their fragility and raising concerns about their reliability in critical applications.

Contribution

The study demonstrates how minimal word swaps can significantly alter explanations of NLP models without changing their predictions, exposing vulnerabilities in current interpretability techniques.

Findings

01

Rank correlation drops over 20% with less than 10% word perturbation

02

Interpretability methods are highly sensitive to small input changes

03

Generated adversarial examples maintain high semantic similarity and prediction consistency

Abstract

Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantically and spatially similar to its seed input (therefore sharing similar interpretations). Simultaneously, the generated examples achieve the same prediction label as the seed yet are given a substantially different explanation by the interpretation methods. Our experiments generate fragile interpretations to attack two SOTA interpretation methods, across three popular Transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qdata/textattack-fragile-interpretations
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Residual Connection