MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu

TL;DR
MVISU-Bench is a comprehensive bilingual benchmark with 404 tasks across 137 mobile apps, designed to evaluate mobile agents on real-world, complex instructions, and introduces Aider, a module that improves success rates and safety.
Contribution
The paper introduces MVISU-Bench, a new benchmark for mobile agents based on real user tasks, and proposes Aider, a plug-and-play module that enhances success and safety in mobile agent interactions.
Findings
Aider improves success rates by 19.55% on MVISU-Bench.
Aider significantly enhances handling of unethical instructions by 53.52%.
The benchmark reveals gaps between current mobile agents and user expectations.
Abstract
Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55\% compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
