Evaluating Attribution Methods using White-Box LSTMs
Yiding Hao

TL;DR
This paper introduces a framework for evaluating neural network interpretability methods using transparent white-box LSTM models, revealing that existing attribution methods often fail to produce accurate explanations even on well-understood models.
Contribution
It proposes a novel evaluation framework using white-box networks to systematically assess attribution methods, highlighting their limitations.
Findings
All five attribution methods failed to produce expected explanations.
White-box LSTMs solved tasks perfectly and transparently.
Evaluation framework can identify shortcomings of interpretability methods.
Abstract
Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
MethodsInterpretability · Tanh Activation · Sigmoid Activation · Long Short-Term Memory
