VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Yichen Gong; Zhuohan Cai; Sunhao Dai; Yuqi Zhou; Zhangxuan Gu; Changhua Meng; Shuheng Shen

arXiv:2604.06182·cs.HC·April 9, 2026

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Yichen Gong, Zhuohan Cai, Sunhao Dai, Yuqi Zhou, Zhangxuan Gu, Changhua Meng, Shuheng Shen

PDF

1 Repo

TL;DR

VenusBench-Mobile is a new, challenging benchmark for mobile GUI agents that emphasizes real-world, user-centric tasks and detailed behavior analysis, exposing current agents' limitations in perception, memory, and robustness.

Contribution

It introduces a realistic, user-driven evaluation framework with capability diagnostics, revealing significant performance gaps and failure modes of existing mobile GUI agents.

Findings

01

State-of-the-art agents perform poorly on VenusBench-Mobile.

02

Failures are mainly due to perception and memory deficiencies.

03

Agents are highly brittle under environment variations.

Abstract

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

inclusionAI/UI-Venus/tree/VenusBench-Mobile
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.