AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin

TL;DR
AgentProcessBench is a new benchmark designed to evaluate step-level effectiveness of tool-using agents in realistic, open-ended scenarios, addressing the limitations of existing mathematical domain benchmarks.
Contribution
Introduces the first benchmark for step-level process quality in tool-using agents, with diverse trajectories, human annotations, and insights into model performance and challenges.
Findings
Weaker models tend to overestimate correctness due to early stopping.
Distinguishing neutral from erroneous actions remains difficult.
Process signals complement outcome supervision, improving scaling.
Abstract
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
