DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Sirui Hong; Zhijie Liu; Tengfei Li; Wei Tao; Yifan Wu; Chenglin Wu

arXiv:2605.17439·cs.SE·May 20, 2026

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Sirui Hong, Zhijie Liu, Tengfei Li, Wei Tao, Yifan Wu, Chenglin Wu

PDF

1 Repo

TL;DR

DiagEval introduces a trajectory-conditioned diagnostic protocol that improves the accuracy of GUI-agent evaluation by effectively attributing failures to software defects or evaluator errors, outperforming retry-based methods.

Contribution

The paper presents DiagEval, a novel diagnostic evaluation protocol that reuses failed trajectories to better attribute failures, enhancing GUI-agent evaluation accuracy.

Findings

01

Recoveries of 45.6-62.1% of initially misattributed failures.

02

Improved evaluation accuracy from 69.9% to 78.3% on WebDevJudge-Unit.

03

Achieved 34.4-160.6% relative gains over retry-based baselines.

Abstract

Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scutGit/DiagEval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.