The Devil is in the Neurons: Interpreting and Mitigating Social Biases   in Pre-trained Language Models

Yan Liu; Yu Liu; Xiaokang Chen; Pin-Yu Chen; Daoguang Zan; Min-Yen; Kan; Tsung-Yi Ho

arXiv:2406.10130·cs.CL·June 17, 2024·1 cites

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Yan Liu, Yu Liu, Xiaokang Chen, Pin-Yu Chen, Daoguang Zan, Min-Yen, Kan, Tsung-Yi Ho

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel interpretability method called IG$^2$ to identify social bias neurons in language models and proposes Bias Neuron Suppression to mitigate biases while preserving model performance.

Contribution

It presents a new neuron attribution technique for social bias detection and a neuron suppression method for bias mitigation in pre-trained language models.

Findings

01

IG$^2$ accurately locates social bias neurons in PLMs.

02

Bias Neuron Suppression reduces social biases effectively.

03

Models maintain language ability with low mitigation cost.

Abstract

Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG $^{2}$ )} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

theNamek/Bias-Neurons
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Dropout · Adam · Linear Layer · Dense Connections · Multi-Head Attention