AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification
Ho Fai Leung, Xiaoyan Xi, Fei Zuo

TL;DR
This paper improves the AndroidControl benchmark through careful curation, revealing that GUI agents are more capable than previously indicated, and introduces a new, efficient SOTA model that performs comparably to larger models.
Contribution
The authors enhanced the AndroidControl benchmark via a rigorous purification process and developed a compact, high-performing model, demonstrating GUI agents' true potential for practical deployment.
Findings
Enhanced benchmark shows 75% success rate for state-of-the-art models
A small, cost-effective model achieves performance comparable to larger models
Benchmark refinement reveals GUI agents are closer to real-world viability
Abstract
On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpreadsheets and End-User Computing · Software System Performance and Reliability · Advanced Malware Detection Techniques
