ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Dawei Li; Yuguang Yao; Zhen Tan; Huan Liu; Ruocheng Guo

arXiv:2601.12294·cs.AI·January 21, 2026

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, Ruocheng Guo

PDF

Open Access

TL;DR

ToolPRMBench is a comprehensive benchmark designed to evaluate process reward models for tool-using agents, enabling systematic assessment of PRM effectiveness and guiding future improvements in tool-guided AI systems.

Contribution

The paper introduces ToolPRMBench, the first large-scale, systematic benchmark for evaluating process reward models in tool-using agents, incorporating both offline and online testing methods.

Findings

01

Specialized PRMs outperform general models in tool-using tasks.

02

Multi-LLM verification reduces label noise and improves data quality.

03

Significant differences in PRM effectiveness across models are observed.

Abstract

Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Data Stream Mining Techniques · Explainable Artificial Intelligence (XAI)