ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, Ruocheng Guo

TL;DR
ToolPRMBench is a comprehensive benchmark designed to evaluate process reward models for tool-using agents, enabling systematic assessment of PRM effectiveness and guiding future improvements in tool-guided AI systems.
Contribution
The paper introduces ToolPRMBench, the first large-scale, systematic benchmark for evaluating process reward models in tool-using agents, incorporating both offline and online testing methods.
Findings
Specialized PRMs outperform general models in tool-using tasks.
Multi-LLM verification reduces label noise and improves data quality.
Significant differences in PRM effectiveness across models are observed.
Abstract
Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Data Stream Mining Techniques · Explainable Artificial Intelligence (XAI)
