Removing Spurious Correlation from Neural Network Interpretations
Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan, Payman, Arabshahi, David Heckerman

TL;DR
This paper introduces a causal mediation approach to improve neural network interpretability by controlling for confounders like conversation topic, reducing spurious correlations in identifying neurons responsible for harmful behaviors.
Contribution
It presents a novel causal mediation method that accounts for confounders, enhancing the accuracy of neural network interpretation in the presence of spurious correlations.
Findings
Adjusting for conversation topic reduces toxicity localization.
Confounders can create misleading neuron attribution.
Proposed method improves interpretability accuracy.
Abstract
The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
