TL;DR
ReflectRM introduces a self-reflective generative reward model that jointly assesses response and analysis preferences, significantly improving alignment accuracy and reducing positional bias in large language model evaluation.
Contribution
It presents a novel unified framework for generative reward modeling that incorporates self-reflection to enhance preference assessment and model stability.
Findings
Achieves +3.7 accuracy gain on Qwen3-4B benchmark.
Substantially reduces positional bias by +10.2 points.
Response and analysis preferences mutually reinforce each other.
Abstract
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
