Self-Debias: Self-correcting for Debiasing Large Language Models
Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An

TL;DR
Self-Debias is a novel framework that enables large language models to self-correct social biases during reasoning, using a resource redistribution approach and online self-improvement with minimal supervision.
Contribution
It introduces a progressive, self-correcting debiasing method that reallocates probability mass and employs consistency filtering, reducing reliance on external interventions.
Findings
Achieves superior debiasing with only 20k annotated samples.
Preserves reasoning capabilities while reducing biases.
Enables autonomous self-correction during reasoning processes.
Abstract
Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
