A causal framework for explaining the predictions of black-box sequence-to-sequence models
David Alvarez-Melis, Tommi S. Jaakkola

TL;DR
This paper introduces a causal framework to interpret black-box sequence-to-sequence models by identifying causally related input-output token groups through perturbation-based analysis, applicable across NLP tasks.
Contribution
It presents a novel causal explanation method for black-box models, leveraging perturbations and graph partitioning to identify relevant token dependencies in sequence-to-sequence predictions.
Findings
Effective in explaining model predictions across NLP tasks
Identifies causally related token groups accurately
Applicable to various structured input-output models
Abstract
We interpret the predictions of any black-box structured input-structured output model around a specific input-output pair. Our method returns an "explanation" consisting of groups of input-output tokens that are causally related. These dependencies are inferred by querying the black-box model with perturbed inputs, generating a graph over tokens from the responses, and solving a partitioning problem to select the most relevant components. We focus the general approach on sequence-to-sequence problems, adopting a variational autoencoder to yield meaningful input perturbations. We test our method across several NLP sequence generation tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
