ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models
Bin Chen, Xinzge Gao, Chuanrui Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao

TL;DR
ReasonGRM introduces a three-stage framework that improves generative reward models by enhancing reasoning quality, reducing hallucinations, and achieving state-of-the-art performance on benchmarks.
Contribution
It presents a novel three-stage training process incorporating reasoning path generation, a new evaluation metric, and reinforcement learning to improve reward modeling.
Findings
Outperforms previous GRMs by 1.8% on average
Surpasses proprietary models like GPT-4o by up to 5.6%
Demonstrates the importance of reasoning-aware training
Abstract
Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, , which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Games
