Causal Proxy Models for Concept-Based Model Explanations
Zhengxuan Wu, Karel D'Oosterlinck, Atticus Geiger, Amir Zur, and, Christopher Potts

TL;DR
This paper introduces Causal Proxy Models (CPMs), which use approximate counterfactuals to provide causal explanations for NLP models, enabling better interpretability without requiring true counterfactual data.
Contribution
The paper proposes CPMs that mimic black-box models and allow for counterfactual interventions, improving explainability in NLP systems.
Findings
CPMs can replicate the input/output behavior of black-box models.
CPMs enable counterfactual interventions for model explanations.
CPMs perform comparably to original models in factual predictions.
Abstract
Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the counterfactual texts necessary for isolating the causal effects of model representations on outputs. In response, many explainability methods make no use of counterfactual texts, assuming they will be unavailable. In this paper, we show that robust causal explainability methods can be created using approximate counterfactuals, which can be written by humans to approximate a specific counterfactual or simply sampled using metadata-guided heuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPM explains a black-box model because it is trained to have the same actual input/output behavior as while creating neural representations that can be intervened upon to simulate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Bayesian Modeling and Causal Inference
