Towards a Unified Framework for Evaluating Explanations
Juan D. Pinto, Luc Paquette

TL;DR
This paper reviews how interpretability is evaluated in ML and HCI, proposing a unified framework that emphasizes faithfulness, intelligibility, plausibility, and stability for explanations.
Contribution
It introduces a unified evaluation framework for interpretability, clarifying relationships between criteria and integrating perspectives from ML and HCI communities.
Findings
Identifies overlaps and misalignments in existing evaluation methods.
Proposes relationships between explanation criteria like faithfulness and intelligibility.
Illustrates the framework with examples from neural network interpretability study.
Abstract
The challenge of creating interpretable models has been taken up by two main research communities: ML researchers primarily focused on lower-level explainability methods that suit the needs of engineers, and HCI researchers who have more heavily emphasized user-centered approaches often based on participatory design methods. This paper reviews how these communities have evaluated interpretability, identifying overlaps and semantic misalignments. We propose moving towards a unified framework of evaluation criteria and lay the groundwork for such a framework by articulating the relationships between existing criteria. We argue that explanations serve as mediators between models and stakeholders, whether for intrinsically interpretable models or opaque black-box models analyzed via post-hoc techniques. We further argue that useful explanations require both faithfulness and intelligibility.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvaluation and Performance Assessment · Scientific Computing and Data Management
