BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation

Sekh Mainul Islam; Nadav Borenstein; Siddhesh Milind Pawar; Haeun Yu; Arnav Arora; Isabelle Augenstein

arXiv:2508.08855·cs.CL·February 3, 2026

BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation

Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

PDF

TL;DR

BiasGym is a versatile framework that injects, analyzes, and mitigates biases in large language models, enabling safer and more interpretable AI systems without sacrificing task performance.

Contribution

It introduces BiasGym, a novel approach combining bias injection and analysis for systematic bias mitigation in LLMs, applicable to unseen biases and maintaining downstream task accuracy.

Findings

01

Effective bias reduction in real-world stereotypes

02

Supports targeted debiasing without performance loss

03

Generalizes to unseen biases during fine-tuning

Abstract

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.