GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui

TL;DR
GUIDE is a hierarchical evaluation framework for GUI agents that decomposes long, complex trajectories into meaningful segments for more accurate, interpretable assessment and diagnostics.
Contribution
The paper introduces GUIDE, a novel hierarchical evaluation method that improves accuracy and interpretability by analyzing GUI agent trajectories in structured subtask segments.
Findings
GUIDE outperforms existing evaluators by up to 5.35 percentage points in accuracy.
It provides structured diagnostic reports with actionable insights.
Validated on three diverse benchmarks with large datasets.
Abstract
Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
