Removing Spurious Correlation from Neural Network Interpretations

Milad Fotouhi; Mohammad Taha Bahadori; Oluwaseyi Feyisetan; Payman; Arabshahi; David Heckerman

arXiv:2412.02893·cs.CL·December 5, 2024

Removing Spurious Correlation from Neural Network Interpretations

Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan, Payman, Arabshahi, David Heckerman

PDF

Open Access

TL;DR

This paper introduces a causal mediation approach to improve neural network interpretability by controlling for confounders like conversation topic, reducing spurious correlations in identifying neurons responsible for harmful behaviors.

Contribution

It presents a novel causal mediation method that accounts for confounders, enhancing the accuracy of neural network interpretation in the presence of spurious correlations.

Findings

01

Adjusting for conversation topic reduces toxicity localization.

02

Confounders can create misleading neuron attribution.

03

Proposed method improves interpretability accuracy.

Abstract

The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling