Mitigating LLM biases toward spurious social contexts using direct preference optimization
Hyunji Nam, Dorottya Demszky

TL;DR
This paper introduces Debiasing-DPO, a self-supervised training method that significantly reduces biases caused by irrelevant social contexts in large language models, improving their robustness and accuracy.
Contribution
The paper proposes Debiasing-DPO, a novel self-supervised training approach that mitigates social context biases in LLMs while maintaining predictive accuracy.
Findings
Debiasing-DPO reduces social context bias by 84%.
Applying Debiasing-DPO improves predictive accuracy by 52%.
Larger models can be more sensitive to spurious contexts without mitigation.
Abstract
LLMs are increasingly used for high-stakes decision-making, yet their sensitivity to spurious contextual information can introduce harmful biases. This is a critical concern when models are deployed for tasks like evaluating teachers' instructional quality, where biased assessment can affect teachers' professional development and career trajectories. We investigate model robustness to spurious social contexts using the largest publicly available dataset of U.S. classroom transcripts (NCTE) paired with expert rubric scores. Evaluating seven frontier and open-weight models across seven categories of spurious contexts -- including teacher experience, education level, demographic identity, and sycophancy-inducing framings -- we find that irrelevant contextual information can shift model predictions by up to 1.48 points on a 7-point scale, with larger models sometimes exhibiting greater…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
