Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

Meiling Ning; Zhongbao Zhang; Junda Ye; Jiabao Guo; Qingyuan Guan

arXiv:2508.18212·cs.CL·November 18, 2025

Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

Meiling Ning, Zhongbao Zhang, Junda Ye, Jiabao Guo, Qingyuan Guan

PDF

1 Video

TL;DR

This paper introduces a novel approach to reward modeling in reinforcement learning by leveraging language models' comprehension capabilities, inspired by natural language inference, leading to more stable and generalizable reward signals.

Contribution

It proposes scaling language models' comprehension boundaries for reward modeling and introduces ESFP-RM, a new two-stage model utilizing MLMs with explanations for improved performance.

Findings

01

MLMs with contextual explanations outperform autoregressive models in NLI tasks.

02

ESFP-RM provides more stable reward signals in RLHF and OOD scenarios.

03

Scaling comprehension boundaries enhances reward model effectiveness.

Abstract

The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Better Language Model-Based Judging Reward Modeling Through Scaling Comprehension Boundaries· underline