Understanding Neural Networks through Representation Erasure
Jiwei Li, Will Monroe, Dan Jurafsky

TL;DR
This paper introduces a methodology for interpreting neural network decisions in NLP by erasing parts of the representation and observing effects, enhancing understanding and error analysis.
Contribution
It presents a general approach to analyze neural models through representation erasure, applicable across various NLP tasks, and introduces techniques like reinforcement learning for minimal erasure.
Findings
Effective in explaining neural decisions
Applicable to multiple NLP tasks
Aids in error analysis
Abstract
While neural networks have been successfully applied to many natural language processing tasks, they come at the cost of interpretability. In this paper, we propose a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words. We present several approaches to analyzing the effects of such erasure, from computing the relative difference in evaluation metrics, to using reinforcement learning to erase the minimum set of input words in order to flip a neural model's decision. In a comprehensive analysis of multiple NLP tasks, including linguistic feature classification, sentence-level sentiment analysis, and document level sentiment aspect prediction, we show that the proposed methodology not only offers clear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
