Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Jiaye Lin, Mengdi Li, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, Di Wang

TL;DR
Curriculum-RLAIF introduces a data-centric curriculum approach to improve reward model generalizability in reinforcement learning from AI feedback, leading to better policy alignment without extra inference costs.
Contribution
It proposes a novel curriculum framework that constructs difficulty-based preference pairs to enhance reward model training and generalizability.
Findings
Reward models trained with Curriculum-RLAIF outperform non-curriculum baselines.
The approach improves policy alignment performance significantly.
Curriculum-RLAIF is simpler, more efficient, and more effective than alternative strategies.
Abstract
Reward models trained through Reinforcement Learning from AI Feedback (RLAIF) methods frequently suffer from limited generalizability, which hinders the alignment performance of policy models. This challenge stems from various issues, including distribution shift, preference label noise, and mismatch of overly challenging samples with model capacity. In this paper, we aim to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from a uniform perspective of data difficulty. Accordingly, we propose a novel framework, Curriculum-RLAIF, which constructs preference pairs with varying difficulty levels and then produces a specific curriculum for reward model training. Comprehensive experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, boosting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
