Generative Reward Models

Dakota Mahan; Duy Van Phung; Rafael Rafailov; Chase Blagden; Nathan; Lile; Louis Castricato; Jan-Philipp Fr\"anken; Chelsea Finn; Alon Albalak

arXiv:2410.12832·cs.LG·October 18, 2024

Generative Reward Models

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan, Lile, Louis Castricato, Jan-Philipp Fr\"anken, Chelsea Finn, Alon Albalak

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GenRM, a hybrid algorithm that improves synthetic preference labels for training large language models by combining RLHF and RLAIF, leading to better alignment with human judgments especially out-of-distribution.

Contribution

We propose GenRM, an iterative method that trains LLMs on self-generated reasoning traces to produce synthetic preferences matching human judgments, bridging the gap between RLHF and RLAIF.

Findings

01

GenRM achieves in-distribution accuracy comparable to Bradley-Terry models.

02

GenRM significantly outperforms Bradley-Terry models on out-of-distribution tasks.

03

GenRM surpasses LLMs used as judges on both in- and out-of-distribution tasks.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments. To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies. We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments. Empirically, we show that zero-shot LLM-based judgments under-perform compared to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

* The paper successfully demonstrates that majority voting (self-consistency) at test time can substantially improve preference modeling accuracy. The results show consistent improvements across different datasets, providing a practical way to enhance preference modeling through additional compute at inference time. * The evaluation for reward modeling covers both in-domain (UltraFeedback) and out-of-domain (RewardBench) datasets. The experimental results demonstrate robust performance improveme

Weaknesses

* The evaluation is limited in demonstrating what the preference accuracy improvements mean for actual policy performance. While the paper shows improvements in preference modeling accuracy, this metric heavily depends on the policy sample distribution. Notably absent is an analysis of Best-of-N (BoN) performance with the proposed reward model, which would be crucial for understanding practical impact. * The practical implementation raises significant concerns. The requirement for 32 majority vo

Reviewer 02Rating 5Confidence 3

Strengths

The strength of this paper is its potential significance and novelty. The proposed direct formulation of the preference indicator looks interesting to me, which may allow more types of approaches and data sources to be combined to train stronger reward models, especially in real-world scenarios that require strong generalization ability.

Weaknesses

The clarity and quality of the paper can be significantly improved. Specifically, for clarity, the organization of the experimental section can be re-structured to highlight the answers for the proposed questions in lines 296-302. Also, the accuracy numbers in Figure 2 and Figure 4 are recommended to be included in the paper, at least in the Appendix. Concerning quality, it would be strongly recommended to have policy models trained on STaR-DPO to see whether this improvement in reward modeling

Reviewer 03Rating 3Confidence 4

Strengths

The paper presents an iterative framework called GenRM designed to train an LLM using self-generated reasoning traces. This framework is novel to some extent and has been validated on both in-distribution and out-of-distribution tasks through various ablation settings. It demonstrates scalability and good efficiency. The experimental results indicate that integrating chain-of-thought reasoning within preference modeling enhances the model's reasoning ability, contributing valuable insights for f

Weaknesses

1. The evaluation of RM is inadequate, lacking experiments that assess agreement with human preferences. Additionally, only one out-of-distribution dataset, RewardBench, is utilized for evaluation. 2. The qualitative results presented in Figure 3 are unconvincing. Both STaR-DPO and LLM-as-a-judge use the same system prompt, as noted in Appendix A.1, which states that evaluations should consider factors such as helpfulness, relevance, accuracy, depth, creativity, and detail. However, your explana

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiverse Scientific and Economic Studies · Diverse Specialized Academic Research

MethodsALIGN · Reinforcement Learning from AI Feedback