GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO
Yiyang Zhao, Huiyu Bai, Xuejiao Zhao

TL;DR
This paper introduces GFRIEND, a novel framework that improves reward model training efficiency in RLHF by using data augmentation, preference refinement, and multi-level optimization, enabling high performance with limited data.
Contribution
The paper presents a new data augmentation and preference refinement framework that enhances reward model training with few-shot data, outperforming traditional methods like DPO.
Findings
Significant improvement in data efficiency for reward models.
Reward models trained with GFRIEND achieve performance comparable to large-scale datasets.
Enhanced preference understanding through Chain-of-Thought sampling and multi-level optimization.
Abstract
The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
