VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng; Shuibai Zhang; Shutong Wu; Christian Classen; Daewon Chae; Ethan Ewer; Minjae Lee; Heeju Kim; Wonjun Kang; Jackson Kunde; Ying Fan; Jungtaek Kim; Hyung Il Koo; Kannan Ramchandran; Dimitris Papailiopoulos; Kangwook Lee

arXiv:2502.06737·cs.LG·June 30, 2025

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee

PDF

Open Access 1 Video

TL;DR

VersaPRM is a multi-domain process reward model trained on synthetic reasoning data, significantly improving performance across diverse non-mathematical domains for large language models.

Contribution

The paper introduces VersaPRM, a novel multi-domain process reward model trained on synthetic data, enhancing generalizability beyond mathematical reasoning.

Findings

01

VersaPRM outperforms baseline models in non-mathematical domains.

02

Achieves a 7.9% performance gain in Law domain with weighted majority voting.

03

Open-sourced data, code, and models for community use.

Abstract

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline -- surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data· slideslive

Taxonomy

TopicsBusiness Process Modeling and Analysis