CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu; Yancheng He; Weixun Wang; Tao Feng; Li Lin; Jiashun Liu; Wenbo Su; Bo Zheng; Xiaojun Wan

arXiv:2601.20327·cs.CL·February 3, 2026

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan

PDF

Open Access 1 Models

TL;DR

This paper introduces CE-RM-4B, a pointwise generative reward model optimized through a two-stage rollout and unified criteria, significantly improving evaluation and reinforcement learning in open-ended language generation.

Contribution

The paper presents a novel pointwise generative reward model with a two-stage rollout and unified criteria, addressing limitations of pairwise evaluation and enhancing RL effectiveness.

Findings

01

Outperforms existing reward models on diverse benchmarks.

02

Achieves superior results in Best-of-N scenarios.

03

Provides more effective improvements in downstream RL tasks.

Abstract

Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
PKU-ONELab/CE-RM-4B
model· 32 dl· ♡ 3
32 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques