Provably Robust DPO: Aligning Language Models with Noisy Feedback
Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

TL;DR
This paper introduces a theoretically grounded robust preference optimization (rDPO) method that effectively mitigates the impact of noisy preference data in aligning language models with human interests, backed by formal guarantees and empirical validation.
Contribution
It proposes a novel loss function for policy optimization that is robust to noisy preferences and provides theoretical bounds on its sub-optimality gap under certain assumptions.
Findings
rDPO outperforms vanilla DPO in noisy settings
Theoretical sub-optimality gap scales with noise level and data size
Empirical results confirm robustness to preference label noise
Abstract
Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
MethodsDirect Preference Optimization · Focus · Shrink and Fine-Tune · FLIP · ALIGN
