Toward Autonomous UI Exploration: The UIExplorer Benchmark

Andrei Cristian Nica; Akshaya Vishnu Kudlu Shanbhogue; Harshil Shah; Aleix Cambray; Tudor Berariu; Lucas Maystre; David Barber

arXiv:2506.17779·cs.LG·June 24, 2025

Toward Autonomous UI Exploration: The UIExplorer Benchmark

Andrei Cristian Nica, Akshaya Vishnu Kudlu Shanbhogue, Harshil Shah, Aleix Cambray, Tudor Berariu, Lucas Maystre, David Barber

PDF

TL;DR

This paper introduces UIExplore-Bench, a benchmark for evaluating autonomous agents' ability to explore user interfaces, providing standardized tasks, metrics, and results that highlight current limitations and future research directions.

Contribution

We present the first dedicated benchmark for UI exploration, formalize exploration metrics, and evaluate agents, setting a foundation for future advancements in autonomous UI understanding.

Findings

01

UIExplore-AlGo achieves up to 77.2% of human performance in Structured mode.

02

Agents perform significantly below human experts, indicating room for improvement.

03

Benchmark and dataset are publicly released to foster further research.

Abstract

Autonomous agents must know how to explore user interfaces (UIs) for reliable task solving, yet systematic evaluation of this crucial phase is lacking. We introduce UIExplore-Bench, the first benchmark explicitly dedicated to UI exploration. The benchmark evaluates agents with either Structured mode (granting access to layout information like DOM trees) or Screen mode (relying on GUI-only observations such as screenshots and human-like mouse/keyboard interactions) across three levels in a standardized GitLab sandbox environment. We formalize exploration as the process of maximizing the set of actionable UI components discovered and propose a metric, human-normalized UI-Functionalities Observed (hUFO), to quantify the effectiveness of exploration. Our results show that UIExplore-AlGo achieves the leading mean hUFO scores, reaching up to 77.2% of human performance in Structured mode and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.