TL;DR
VenusBench-Mobile is a new, challenging benchmark for mobile GUI agents that emphasizes real-world, user-centric tasks and detailed behavior analysis, exposing current agents' limitations in perception, memory, and robustness.
Contribution
It introduces a realistic, user-driven evaluation framework with capability diagnostics, revealing significant performance gaps and failure modes of existing mobile GUI agents.
Findings
State-of-the-art agents perform poorly on VenusBench-Mobile.
Failures are mainly due to perception and memory deficiencies.
Agents are highly brittle under environment variations.
Abstract
Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
