Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Shivam Dubey

arXiv:2508.09019·cs.AI·August 13, 2025

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Shivam Dubey

PDF

Open Access

TL;DR

This paper introduces an interpretable activation steering method that detects and mitigates biases within large language models by manipulating internal activations, leading to safer and more accountable AI outputs.

Contribution

The work presents a novel end-to-end bias mitigation system using mechanistic interpretability to identify and actively steer away from biased content within LLMs.

Findings

01

Probes accurately detect bias in GPT-2 layers

02

Steering vectors effectively reduce stereotypical outputs

03

Bias mitigation is achieved in real-time during inference

Abstract

As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation, which treat the model as an opaque black box. In this work, we introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias directly within a model's internal workings. Our method involves two primary stages. First, we train linear "probes" on the internal activations of a model to detect the latent representations of various biases (e.g., gender, race, age). Our experiments on \texttt{gpt2-large} demonstrate that these probes can identify biased content with near-perfect accuracy, revealing that bias representations become most salient in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)