Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Yujie Lin; Kunquan Li; Yixuan Liao; Xiaoxin Chen; Jinsong Su

arXiv:2602.04398·cs.CL·February 5, 2026

Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Yujie Lin, Kunquan Li, Yixuan Liao, Xiaoxin Chen, Jinsong Su

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel bias mitigation framework for large language models that detects stereotype words and attributes bias to specific neurons, enabling bias reduction without fine-tuning or prompt changes.

Contribution

It presents a neuron-level bias attribution and intervention method that effectively reduces social biases in LLMs without altering their prompts or requiring additional training.

Findings

01

Reduces social bias in LLM outputs

02

Preserves overall model performance

03

Applicable across multiple LLMs

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The method avoids the computational and maintenance costs of fine-tuning or prompt engineering, relying only on activation-level adjustments. 2. Integrated-gradient-based neuron attribution offers transparency and clear diagnostics of where biases emerge. 3. Experiments showcase the effectiveness of this method.

Weaknesses

BBA assumes access to hidden activations and gradients, which is feasible for open-source LLMs (e.g., LLaMA, Mistral) but not available in closed-source systems such as GPT models. Although I think it is not very important limitation, we hope the authors can clearly locate this study as designed for **open-source models**.

Reviewer 02Rating 4Confidence 4

Strengths

With the widespread adoption of large language models, addressing their social impact—particularly bias—has become increasingly critical. This paper tackles this important issue by proposing a practical solution tailored for open-weight LLMs. In addition to empirical results, the paper offers theoretical analysis that sheds light on the underlying mechanisms of bias and its mitigation, which adds depth to the contribution.

Weaknesses

Since this paper focuses on social bias, rigorous and meaningful evaluation is both crucial and challenging. I have two main concerns in this regard. First, the evaluation of the bias-related word selection process lacks clarity and quantitative justification. Second, the methodology used for evaluating bias in the language models themselves needs further elaboration and validation. (See detailed comments in the Questions section.)

Reviewer 03Rating 4Confidence 5

Strengths

The paper considers an important problem -- how to mitigate social biases in LLMs. Considering much prior work that use prompt-based approaches or fine-tuning LLMs (or alignment) this paper proposes a different approach where a subset of neurones responsible for social biases are identified and then acted upon.

Weaknesses

I do not understand why P(man | The doctor is likely a) is considered as a stereotypical inference in Definition 2. For example, it could indeed be an image of a male doctor shown to an LLM and the correct prediction would be it is a man. It does not cause any stereotypical bias against the disadvantaged group (i.e. females in this case). - The definitions of social bias types considered in the paper are not provided. For example, do you consider gender to be binary? This would affect how for e

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education