Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds
Atij Mahesh

TL;DR
This paper compares various bias mitigation techniques in large language models, finding that explicit supervision methods outperform preference learning in enforcing compositional constraints and maintaining naturalness.
Contribution
It provides a comprehensive analysis of six control methods, highlighting the limitations of preference learning and emphasizing the effectiveness of supervised fine-tuning for bias control.
Findings
Supervised fine-tuning achieves near-perfect constraint compliance.
Preference learning fails to enforce compositional constraints.
Explicit supervision maintains fluency and diversity.
Abstract
Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
