AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan; Xuyan Ye; Yupeng Huo; Zhi-Yuan Chen; Yiju Guo; Shenzhi Yang; Wenkai Yang; Shuqi Ye; Jingwen Chen; Haotian Chen; Xin Cong; Yankai Lin

arXiv:2603.14465·cs.AI·March 17, 2026

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin

PDF

Open Access 1 Datasets

TL;DR

AgentProcessBench is a new benchmark designed to evaluate step-level effectiveness of tool-using agents in realistic, open-ended scenarios, addressing the limitations of existing mathematical domain benchmarks.

Contribution

Introduces the first benchmark for step-level process quality in tool-using agents, with diverse trajectories, human annotations, and insights into model performance and challenges.

Findings

01

Weaker models tend to overestimate correctness due to early stopping.

02

Distinguishing neutral from erroneous actions remains difficult.

03

Process signals complement outcome supervision, improving scaling.

Abstract

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

LulaCola/AgentProcessBench
dataset· 328 dl
328 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications