A Statistical Framework for Alignment with Biased AI Feedback

Xintao Xia; Zhiqiu Xia; Linjun Zhang; Zhanrui Cai

arXiv:2602.08259·stat.ML·February 10, 2026

A Statistical Framework for Alignment with Biased AI Feedback

Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai

PDF

Open Access

TL;DR

This paper introduces two novel debiased alignment methods for large language models that effectively mitigate systematic biases in AI-generated feedback, improving alignment accuracy with human preferences.

Contribution

It develops DDPO and DIPO, two methods that address bias in AI feedback, with theoretical guarantees and practical improvements over existing approaches.

Findings

01

Both methods significantly improve alignment efficiency.

02

They recover performance close to fully human-labeled data.

03

Empirical results span sentiment, summarization, and dialogue tasks.

Abstract

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Speech and dialogue systems