Causal Abstractions of Neural Networks
Atticus Geiger, Hanson Lu, Thomas Icard, Christopher Potts

TL;DR
This paper introduces a formal causal abstraction framework for analyzing neural network internal representations, verified through interventions, applied to natural language inference models to reveal their encoding of compositional causal structures.
Contribution
It presents a novel causal abstraction method for neural network analysis, linking internal representations to interpretable causal models and verifying their causal properties through interventions.
Findings
BERT models encode parts of the natural logic causal structure.
Simpler models do not exhibit the same causal structure.
The method provides rich characterizations of neural representations.
Abstract
Structural analysis methods (e.g., probing and feature attribution) are increasingly important tools for neural network analysis. We propose a new structural analysis method grounded in a formal theory of causal abstraction that provides rich characterizations of model-internal representations and their roles in input/output behavior. In this method, neural representations are aligned with variables in interpretable causal models, and then interchange interventions are used to experimentally verify that the neural representations have the causal properties of their aligned variables. We apply this method in a case study to analyze neural models trained on Multiply Quantified Natural Language Inference (MQNLI) corpus, a highly complex NLI dataset that was constructed with a tree-structured natural logic causal model. We discover that a BERT-based model with state-of-the-art performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
MethodsLinear Layer · Layer Normalization · Residual Connection · Softmax · Dense Connections · Linear Warmup With Linear Decay · WordPiece · Weight Decay · Attention Is All You Need · Attention Dropout
