TL;DR
This paper introduces a novel approach to reward modeling in reinforcement learning by leveraging language models' comprehension capabilities, inspired by natural language inference, leading to more stable and generalizable reward signals.
Contribution
It proposes scaling language models' comprehension boundaries for reward modeling and introduces ESFP-RM, a new two-stage model utilizing MLMs with explanations for improved performance.
Findings
MLMs with contextual explanations outperform autoregressive models in NLI tasks.
ESFP-RM provides more stable reward signals in RLHF and OOD scenarios.
Scaling comprehension boundaries enhances reward model effectiveness.
Abstract
The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
