Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Weikai Xu; Zhizheng Jiang; Yuxuan Liu; Pengzhi Gao; Wei Liu; Jian Luan; Yuanchun Li; Yunxin Liu; Bin Wang; Bo An

arXiv:2505.11891·cs.CL·February 3, 2026

Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An

PDF

Open Access

TL;DR

Mobile-Bench-v2 introduces a realistic, comprehensive benchmark for evaluating VLM-based mobile agents across multiple challenging scenarios, addressing previous limitations in stability, multi-solution handling, noise robustness, and proactive interaction assessment.

Contribution

The paper presents Mobile-Bench-v2, a new benchmark with diverse splits and evaluation methods to better assess mobile agents' capabilities in realistic environments.

Findings

01

Mobile-Bench-v2 enables multi-path offline evaluation.

02

It includes noisy and ambiguous instruction splits.

03

Evaluations show varied agent performances across scenarios.

Abstract

VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Agent-Based Network Management · Multi-Agent Systems and Negotiation · Advanced Software Engineering Methodologies