Evaluating Attribution Methods using White-Box LSTMs

Yiding Hao

arXiv:2010.08606·cs.LG·October 20, 2020

Evaluating Attribution Methods using White-Box LSTMs

Yiding Hao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a framework for evaluating neural network interpretability methods using transparent white-box LSTM models, revealing that existing attribution methods often fail to produce accurate explanations even on well-understood models.

Contribution

It proposes a novel evaluation framework using white-box networks to systematically assess attribution methods, highlighting their limitations.

Findings

01

All five attribution methods failed to produce expected explanations.

02

White-box LSTMs solved tasks perfectly and transparently.

03

Evaluation framework can identify shortcomings of interpretability methods.

Abstract

Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yidinghao/whitebox-lstm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification

MethodsInterpretability · Tanh Activation · Sigmoid Activation · Long Short-Term Memory