BiasFilter: An Inference-Time Debiasing Framework for Large Language Models
Xiaoqing Cheng, Ruizhe Chen, Hongying Zan, Yuxiang Jia, and Min Peng

TL;DR
BiasFilter is a novel inference-time framework that reduces social bias in large language models by filtering outputs in real time, without retraining or modifying the original models, thus improving fairness efficiently.
Contribution
It introduces BiasFilter, a model-agnostic, inference-time debiasing method that filters LLM outputs based on a learned fairness reward, scalable to large models and open-ended tasks.
Findings
Effectively reduces social bias across various LLMs
Preserves overall generation quality
Operates efficiently without retraining or model modification
Abstract
Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
