Towards Robust Interpretability with Self-Explaining Neural Networks

David Alvarez-Melis; Tommi S. Jaakkola

arXiv:1806.07538·cs.LG·December 5, 2018·419 cites

Towards Robust Interpretability with Self-Explaining Neural Networks

David Alvarez-Melis, Tommi S. Jaakkola

PDF

Open Access

TL;DR

This paper introduces a new framework for self-explaining neural networks that prioritize interpretability during training, ensuring explanations are explicit, faithful, and stable, and demonstrates its effectiveness on benchmark datasets.

Contribution

The paper proposes a staged approach to develop self-explaining models with tailored regularization, advancing interpretability during the learning process.

Findings

01

Models satisfy explicitness, faithfulness, and stability criteria.

02

Experimental results show improved interpretability without sacrificing performance.

03

Framework is effective across various benchmark datasets.

Abstract

Most recent work on interpretability of complex machine learning models has focused on estimating $a posteriori$ explanations for previously trained models around specific predictions. $Self-explaining$ models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general -- explicitness, faithfulness, and stability -- and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification

MethodsInterpretability