Filtered Direct Preference Optimization
Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

TL;DR
This paper investigates the impact of text quality in preference datasets for RLHF, introduces filtered DPO (fDPO) which discards low-quality samples during training, and demonstrates improved model performance.
Contribution
It proposes fDPO, an extension of DPO that filters out low-quality data using a reward model, enhancing RLHF effectiveness.
Findings
fDPO improves model performance over standard DPO.
Text quality significantly affects RLHF outcomes.
Filtering data during training leads to more accurate preference models.
Abstract
Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsConstraint Satisfaction and Optimization
MethodsDirect Preference Optimization
