When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF
Yifan Xu, Xichen Ye, Yifan Chen, Qiaosheng Zhang

TL;DR
This paper proposes a robust preference optimization algorithm for RLHF that accounts for human preference flipping, improving LLM alignment by modeling instance-dependent flipping probabilities and uncertainty in human judgments.
Contribution
Introduces FA-DPO, a novel algorithm that explicitly models preference flipping in RLHF using an instance-dependent approach and integrates it into existing optimization frameworks.
Findings
FA-DPO outperforms baseline methods in robustness against preference flips.
Modeling preference uncertainty improves annotation reliability.
Experimental results validate the effectiveness of the instance-dependent flipping model.
Abstract
Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management
