A Statistical Framework for Alignment with Biased AI Feedback
Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai

TL;DR
This paper introduces two novel debiased alignment methods for large language models that effectively mitigate systematic biases in AI-generated feedback, improving alignment accuracy with human preferences.
Contribution
It develops DDPO and DIPO, two methods that address bias in AI feedback, with theoretical guarantees and practical improvements over existing approaches.
Findings
Both methods significantly improve alignment efficiency.
They recover performance close to fully human-labeled data.
Empirical results span sentiment, summarization, and dialogue tasks.
Abstract
Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Speech and dialogue systems
