Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic   Data Generation

Jiaming Shen; Ran Xu; Yennie Jun; Zhen Qin; Tianqi Liu; Carl Yang; Yi; Liang; Simon Baumgartner; Michael Bendersky

arXiv:2407.16008·cs.CL·March 18, 2025

Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation

Jiaming Shen, Ran Xu, Yennie Jun, Zhen Qin, Tianqi Liu, Carl Yang, Yi, Liang, Simon Baumgartner, Michael Bendersky

PDF

TL;DR

This paper introduces RMBoost, a synthetic data generation method that improves reward models by creating more diverse and intentionally constructed preference pairs, reducing noise and enhancing alignment with human preferences.

Contribution

RMBoost is a novel preference data generation paradigm that conditions response generation on pre-selected preferences, improving reward model training and performance.

Findings

01

RMBoost outperforms existing synthetic data methods in experiments.

02

It significantly enhances reward model accuracy across multiple datasets.

03

The approach reduces noise and increases response diversity.

Abstract

Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. Unlike traditional methods, which generate two responses before obtaining the preference label, RMBoost first generates one response and selects a preference label, followed by generating the second more (or less) preferred response conditioned on the pre-selected preference label and the first response. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.