Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel, Nevo, Simas Sakenis, Jason Huang, Yaron Singer, Stuart Shieber

TL;DR
This paper introduces a causal mediation analysis framework to interpret neural NLP models, revealing how specific components contribute to gender bias in Transformer models.
Contribution
It applies causal mediation analysis to neural NLP, uncovering the causal roles of neurons and attention heads in gender bias propagation.
Findings
Gender bias is concentrated in few network components.
Bias effects are synergistic and context-dependent.
Bias can be decomposed into direct and mediated effects.
Abstract
Common methods for interpreting neural models in natural language processing typically examine either their structure or their behavior, but not both. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. It enables us to analyze the mechanisms by which information flows from input to output through various model components, known as mediators. We apply this methodology to analyze gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
