ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

TL;DR
ProcBench is a new benchmark that evaluates the execution process of LLM coding agents, identifying defects and control preservation issues to provide a more comprehensive assessment than outcome-based metrics.
Contribution
It introduces a process-level evaluation framework with an ontology of defect types, standardized logs, and control preservation metrics for LLM coding agents.
Findings
ProcBench reliably detects execution defects across diverse cases.
It provides stable semantics and reveals differences in execution quality.
ProcBench uncovers issues often missed by outcome-only evaluations.
Abstract
Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcBench, a benchmark for execution-process evaluation in LLM coding agents. ProcBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
