TL;DR
DiagEval introduces a trajectory-conditioned diagnostic protocol that improves the accuracy of GUI-agent evaluation by effectively attributing failures to software defects or evaluator errors, outperforming retry-based methods.
Contribution
The paper presents DiagEval, a novel diagnostic evaluation protocol that reuses failed trajectories to better attribute failures, enhancing GUI-agent evaluation accuracy.
Findings
Recoveries of 45.6-62.1% of initially misattributed failures.
Improved evaluation accuracy from 69.9% to 78.3% on WebDevJudge-Unit.
Achieved 34.4-160.6% relative gains over retry-based baselines.
Abstract
Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
