Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Schrasing Tong; Eliott Zemour; Jessica Lu; Rawisara Lohanimit; Lalana Kagal

arXiv:2412.01711·cs.CL·March 9, 2026

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Schrasing Tong, Eliott Zemour, Jessica Lu, Rawisara Lohanimit, Lalana Kagal

PDF

Open Access

TL;DR

This paper introduces a resource-efficient, interpretable bias mitigation method for large language models using small expert models to generate debiasing signals during decoding, effectively reducing biases across multiple domains.

Contribution

The paper proposes a novel bias mitigation approach leveraging small expert models for decoding-time bias correction, enhancing efficiency and interpretability over traditional re-training methods.

Findings

01

Reduces gender, race, and religion biases in LLMs

02

Maintains language model performance after bias mitigation

03

Applicable across different architectures and bias types

Abstract

Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that is added to the LLM output at decoding-time. This approach combines computational efficiency - fine-tuning a small model versus re-training a large model and interpretability - one can examine the probability shift from debiasing. The framework can also be tailored to specific contexts by switching the choice of the fine-tuning dataset. Experiments on mitigating gender, race, and religion biases on different architectures show a reduction in bias on several local and global bias metrics while preserving language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling