D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang

TL;DR
D-GARA is a dynamic benchmarking framework designed to evaluate Android GUI agent robustness against real-world anomalies, revealing significant performance drops and emphasizing the need for robustness-aware learning.
Contribution
We introduce D-GARA, a novel extensible framework and benchmark for testing GUI agent robustness in realistic anomaly scenarios, addressing limitations of static, idealized datasets.
Findings
State-of-the-art GUI agents show performance degradation with anomalies.
D-GARA supports diverse, real-world anomaly types for comprehensive evaluation.
Framework's modular design allows easy extension and customization.
Abstract
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Anomaly Detection Techniques and Applications · Software System Performance and Reliability
