Understanding and Mitigating Gender Bias in LLMs via Interpretable   Neuron Editing

Zeping Yu; Sophia Ananiadou

arXiv:2501.14457·cs.CL·January 27, 2025

Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing

Zeping Yu, Sophia Ananiadou

PDF

Open Access

TL;DR

This paper introduces a dataset and neuron editing method to understand and reduce gender bias in large language models, effectively balancing bias mitigation with preserving model capabilities.

Contribution

The paper provides a new dataset, analyzes bias mechanisms, and proposes an interpretable neuron editing approach that outperforms existing methods.

Findings

01

Effective reduction of gender bias in five LLMs

02

Identification of specific gender and general neurons responsible for bias

03

Neuron editing preserves model capabilities while reducing bias

Abstract

Large language models (LLMs) often exhibit gender bias, posing challenges for their safe deployment. Existing methods to mitigate bias lack a comprehensive understanding of its mechanisms or compromise the model's core capabilities. To address these issues, we propose the CommonWords dataset, to systematically evaluate gender bias in LLMs. Our analysis reveals pervasive bias across models and identifies specific neuron circuits, including gender neurons and general neurons, responsible for this behavior. Notably, editing even a small number of general neurons can disrupt the model's overall capabilities due to hierarchical neuron interactions. Based on these insights, we propose an interpretable neuron editing method that combines logit-based and causal-based strategies to selectively target biased neurons. Experiments on five LLMs demonstrate that our method effectively reduces gender…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property · Natural Language Processing Techniques