On the Lack of Robust Interpretability of Neural Text Classifiers
Muhammad Bilal Zafar, Michele Donini, Dylan Slack, C\'edric, Archambeau, Sanjiv Das, Krishnaram Kenthapadi

TL;DR
This paper investigates the robustness of interpretability methods for neural text classifiers, revealing significant deviations that question the reliability of current interpretability approaches.
Contribution
It introduces two randomization tests to evaluate the robustness of feature-based interpretability methods for Transformer-based models, highlighting their limitations.
Findings
Interpretations vary significantly between models with different initializations.
Interpretations differ markedly between trained and randomly initialized models.
Current interpretability methods may not provide reliable insights.
Abstract
With the ever-increasing complexity of neural language models, practitioners have turned to methods for understanding the predictions of these models. One of the most well-adopted approaches for model interpretability is feature-based interpretability, i.e., ranking the features in terms of their impact on model predictions. Several prior studies have focused on assessing the fidelity of feature-based interpretability methods, i.e., measuring the impact of dropping the top-ranked features on the model output. However, relatively little work has been conducted on quantifying the robustness of interpretations. In this work, we assess the robustness of interpretations of neural text classifiers, specifically, those based on pretrained Transformer encoders, using two randomization tests. The first compares the interpretations of two models that are identical except for their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Residual Connection · Dense Connections
