Interpretability is in the Mind of the Beholder: A Causal Framework for Human-interpretable Representation Learning
Emanuele Marconato, Andrea Passerini, Stefano Teso

TL;DR
This paper introduces a causal framework for human-interpretable representation learning that models the human as an external observer, linking interpretability, alignment, and disentanglement through a formal information-theoretic approach.
Contribution
It proposes a mathematical framework for acquiring interpretable representations that explicitly incorporates the human perspective, unifying concepts like alignment, disentanglement, and concept leakage.
Findings
Defines a formal notion of alignment between machine and human concepts.
Links alignment to disentanglement and concept leakage.
Provides an information-theoretic reformulation of interpretability properties.
Abstract
Focus in Explainable AI is shifting from explanations defined in terms of low-level elements, such as input features, to explanations encoded in terms of interpretable concepts learned from data. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post-hoc explainers and concept-based neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: a representation is understandable only insofar as it can be understood by the human at the receiving end. The key challenge in Human-interpretable Representation Learning (HRL) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring interpretable representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Healthcare
