BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation
Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

TL;DR
BiasGym is a versatile framework that injects, analyzes, and mitigates biases in large language models, enabling safer and more interpretable AI systems without sacrificing task performance.
Contribution
It introduces BiasGym, a novel approach combining bias injection and analysis for systematic bias mitigation in LLMs, applicable to unseen biases and maintaining downstream task accuracy.
Findings
Effective bias reduction in real-world stereotypes
Supports targeted debiasing without performance loss
Generalizes to unseen biases during fine-tuning
Abstract
Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
