Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs
Shivam Dubey

TL;DR
This paper introduces an interpretable activation steering method that detects and mitigates biases within large language models by manipulating internal activations, leading to safer and more accountable AI outputs.
Contribution
The work presents a novel end-to-end bias mitigation system using mechanistic interpretability to identify and actively steer away from biased content within LLMs.
Findings
Probes accurately detect bias in GPT-2 layers
Steering vectors effectively reduce stereotypical outputs
Bias mitigation is achieved in real-time during inference
Abstract
As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation, which treat the model as an opaque black box. In this work, we introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias directly within a model's internal workings. Our method involves two primary stages. First, we train linear "probes" on the internal activations of a model to detect the latent representations of various biases (e.g., gender, race, age). Our experiments on \texttt{gpt2-large} demonstrate that these probes can identify biased content with near-perfect accuracy, revealing that bias representations become most salient in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
