From Black-box to Causal-box: Towards Building More Interpretable Models
Inwoo Hwang, Yushu Pan, Elias Bareinboim

TL;DR
This paper introduces the concept of causal interpretability for models, analyzing existing models' limitations, and proposing a framework to design models that can answer counterfactual questions, balancing interpretability and accuracy.
Contribution
It formalizes causal interpretability, provides a graphical criterion for model design supporting counterfactual queries, and characterizes the tradeoff between interpretability and predictive power.
Findings
Blackbox and concept-based models are not causally interpretable in general.
A framework for designing causally interpretable models is developed.
Experiments validate the theoretical tradeoff between interpretability and accuracy.
Abstract
Understanding the predictions made by deep learning models remains a central challenge, especially in high-stakes applications. A promising approach is to equip models with the ability to answer counterfactual questions -- hypothetical ``what if?'' scenarios that go beyond the observed data and provide insight into a model reasoning. In this work, we introduce the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a specific class of models and observational data. We analyze two common model classes -- blackbox and concept-based predictors -- and show that neither is causally interpretable in general. To address this gap, we develop a framework for building models that are causally interpretable by design. Specifically, we derive a complete graphical criterion that determines whether a given model architecture supports a given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
