DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

Yuanhao Wu; Juntong Song; Hanning Zhang; Tong Zhang; Cheng Niu

arXiv:2506.17533·cs.CL·June 24, 2025

DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

Yuanhao Wu, Juntong Song, Hanning Zhang, Tong Zhang, Cheng Niu

PDF

TL;DR

DuaShepherd introduces a reward modeling framework that combines correctness and potential signals to improve mathematical reasoning in Large Language Models, achieving state-of-the-art results.

Contribution

It presents a novel multi-task reward model integrating correctness and potential signals, enhancing LLMs' mathematical reasoning capabilities.

Findings

01

Outperforms models trained on single signals

02

Achieves state-of-the-art results on MATH500 and ProcessBench

03

Demonstrates benefits of combined reward signals

Abstract

In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.