Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Ilgee Hong; Changlong Yu; Liang Qiu; Weixiang Yan; Zhenghao Xu; Haoming Jiang; Qingru Zhang; Qin Lu; Xin Liu; Chao Zhang; Tuo Zhao

arXiv:2505.16265·cs.LG·May 23, 2025

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao

PDF

Open Access 1 Repo 3 Models

TL;DR

Think-RM introduces a novel training framework for generative reward models that enhances long-horizon reasoning capabilities, enabling more nuanced and complex task handling in reinforcement learning from human feedback.

Contribution

It presents a new approach to train GenRMs with internal reasoning processes and a pairwise RLHF pipeline, improving performance on complex tasks and overcoming limitations of existing models.

Findings

01

Achieves state-of-the-art results on RM-Bench with 8% improvement.

02

Outperforms traditional BT RMs and scaled GenRMs in complex reasoning tasks.

03

Demonstrates superior end-policy performance with pairwise RLHF.

Abstract

Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ilgeehong/think-rm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications