MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Zeyu Huang; Juyuan Wang; Longfeng Chen; Boyi Xiao; Leng Cai; Yawen Zeng; Jin Xu

arXiv:2508.09057·cs.CL·August 18, 2025

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu

PDF

TL;DR

MVISU-Bench is a comprehensive bilingual benchmark with 404 tasks across 137 mobile apps, designed to evaluate mobile agents on real-world, complex instructions, and introduces Aider, a module that improves success rates and safety.

Contribution

The paper introduces MVISU-Bench, a new benchmark for mobile agents based on real user tasks, and proposes Aider, a plug-and-play module that enhances success and safety in mobile agent interactions.

Findings

01

Aider improves success rates by 19.55% on MVISU-Bench.

02

Aider significantly enhances handling of unethical instructions by 53.52%.

03

The benchmark reveals gaps between current mobile agents and user expectations.

Abstract

Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55\% compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.