Towards Robust Interpretability with Self-Explaining Neural Networks
David Alvarez-Melis, Tommi S. Jaakkola

TL;DR
This paper introduces a new framework for self-explaining neural networks that prioritize interpretability during training, ensuring explanations are explicit, faithful, and stable, and demonstrates its effectiveness on benchmark datasets.
Contribution
The paper proposes a staged approach to develop self-explaining models with tailored regularization, advancing interpretability during the learning process.
Findings
Models satisfy explicitness, faithfulness, and stability criteria.
Experimental results show improved interpretability without sacrificing performance.
Framework is effective across various benchmark datasets.
Abstract
Most recent work on interpretability of complex machine learning models has focused on estimating explanations for previously trained models around specific predictions. models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general -- explicitness, faithfulness, and stability -- and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
MethodsInterpretability
