ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Jiawei He; Jie Jia; Chenbo Liu; Chaoyi Xue; Yapeng Song; Xikai Yang; Dong Sun

arXiv:2605.20251·cs.SE·May 22, 2026

ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

PDF

TL;DR

ProcBench is a new benchmark that evaluates the execution process of LLM coding agents, identifying defects and control preservation issues to provide a more comprehensive assessment than outcome-based metrics.

Contribution

It introduces a process-level evaluation framework with an ontology of defect types, standardized logs, and control preservation metrics for LLM coding agents.

Findings

01

ProcBench reliably detects execution defects across diverse cases.

02

It provides stable semantics and reveals differences in execution quality.

03

ProcBench uncovers issues often missed by outcome-only evaluations.

Abstract

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcBench, a benchmark for execution-process evaluation in LLM coding agents. ProcBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.