Agentic Reward Modeling: Integrating Human Preferences with Verifiable   Correctness Signals for Reliable Reward Systems

Hao Peng; Yunjia Qi; Xiaozhi Wang; Zijun Yao; Bin Xu; Lei Hou; Juanzi; Li

arXiv:2502.19328·cs.CL·February 27, 2025

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi, Li

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces agentic reward modeling, which combines human preferences with verifiable correctness signals like factuality and instruction following to create more reliable reward systems for training large language models, showing significant improvements.

Contribution

The paper proposes a novel reward system that integrates human preferences with verifiable correctness signals, enhancing reward reliability for LLM training.

Findings

01

RewardAgent outperforms vanilla reward models in benchmarks.

02

Training LLMs with RewardAgent improves NLP benchmark scores.

03

Combining signals leads to more trustworthy reward signals.

Abstract

Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-keg/agentic-reward-modeling
noneOfficial

Datasets

THU-KEG/IFBench
dataset· 496 dl
496 dl

Videos

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Recommender Systems and Techniques

MethodsDirect Preference Optimization · Focus