Self-Generated Critiques Boost Reward Modeling for Language Models

Yue Yu; Zhengxing Chen; Aston Zhang; Liang Tan; Chenguang Zhu; Richard; Yuanzhe Pang; Yundi Qian; Xuewei Wang; Suchin Gururangan; Chao Zhang; Melanie; Kambadur; Dhruv Mahajan; Rui Hou

arXiv:2411.16646·cs.CL·February 11, 2025

Self-Generated Critiques Boost Reward Modeling for Language Models

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard, Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie, Kambadur, Dhruv Mahajan, Rui Hou

PDF

Open Access 1 Video

TL;DR

This paper introduces Critic-RM, a novel framework that enhances reward modeling for language models by using self-generated critiques to improve alignment with human preferences, achieving significant accuracy gains.

Contribution

Critic-RM is the first approach to incorporate self-generated critiques into reward modeling without extra supervision, improving accuracy and reasoning in language models.

Findings

01

Reward modeling accuracy improved by 3.7%-7.3%.

02

Generated critiques help rectify flawed reasoning steps.

03

Demonstrates strong performance and data efficiency.

Abstract

Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Generated Critiques Boost Reward Modeling for Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques