MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong; Xu Zhang; Zhenyu Yang; Nolan Gao; Chen Liu; Panrong Tong; Chenglin Cai; Hanzhang Zhou; Jianan Zhang; Liangyu Chen; Zhidan Liu; Steven Hoi; and Yue Wang

arXiv:2512.19432·cs.CL·January 1, 2026

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang

PDF

Open Access 1 Datasets

TL;DR

MobileWorld is a new, more challenging benchmark for evaluating autonomous mobile agents, emphasizing real-world workflows, multi-application tasks, and user interactions, revealing significant performance gaps and research opportunities.

Contribution

The paper introduces MobileWorld, a comprehensive benchmark with diverse tasks and novel scenarios, addressing limitations of existing benchmarks like AndroidWorld.

Findings

01

MobileWorld features nearly twice as many steps per task as AndroidWorld.

02

Agents show a sharp performance drop on MobileWorld, with success rates below 52%.

03

The benchmark enables evaluation of user-aware, hybrid-tool scenarios.

Abstract

Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks across 20 applications. MobileWorld derives its difficulty from an emphasis on long-horizon, cross-application workflows, requiring nearly twice as many completion steps on average (27.8 vs. 14.3) and featuring a significantly higher proportion of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Tongyi-MAI/MobileWorld
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Agent-Based Network Management · Advanced Software Engineering Methodologies · Multi-Agent Systems and Negotiation