Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation
Jiaming Shen, Ran Xu, Yennie Jun, Zhen Qin, Tianqi Liu, Carl Yang, Yi, Liang, Simon Baumgartner, Michael Bendersky

TL;DR
This paper introduces RMBoost, a synthetic data generation method that improves reward models by creating more diverse and intentionally constructed preference pairs, reducing noise and enhancing alignment with human preferences.
Contribution
RMBoost is a novel preference data generation paradigm that conditions response generation on pre-selected preferences, improving reward model training and performance.
Findings
RMBoost outperforms existing synthetic data methods in experiments.
It significantly enhances reward model accuracy across multiple datasets.
The approach reduces noise and increases response diversity.
Abstract
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. Unlike traditional methods, which generate two responses before obtaining the preference label, RMBoost first generates one response and selects a preference label, followed by generating the second more (or less) preferred response conditioned on the pre-selected preference label and the first response. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
