Local Contrastive Editing of Gender Stereotypes
Marlene Lutz, Rochelle Choenni, Markus Strohmaier, Anne Lauscher

TL;DR
This paper presents a method called local contrastive editing to precisely identify and modify small subsets of weights in language models that encode gender stereotypes, improving understanding and control of bias.
Contribution
It introduces a novel local contrastive editing technique to localize and edit gender bias in language model parameters, enabling targeted bias mitigation.
Findings
Identifies < 0.5% of weights associated with gender stereotypes.
Demonstrates precise localization and control of gender bias in models.
Advances understanding of bias manifestation in model parameters.
Abstract
Stereotypical bias encoded in language models (LMs) poses a threat to safe language technology, yet our understanding of how bias manifests in the parameters of LMs remains incomplete. We introduce local contrastive editing that enables the localization and editing of a subset of weights in a target model in relation to a reference model. We deploy this approach to identify and modify subsets of weights that are associated with gender stereotypes in LMs. Through a series of experiments, we demonstrate that local contrastive editing can precisely localize and control a small subset (< 0.5%) of weights that encode gender bias. Our work (i) advances our understanding of how stereotypical biases can manifest in the parameter space of LMs and (ii) opens up new avenues for developing parameter-efficient strategies for controlling model properties in a contrastive manner.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
