MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Youngmin Im; Byeongung Jo; Jaeyoung Wi; Seungwoo Baek; Tae Hoon Min; Joo Hyung Lee; Sangeun Oh; Insik Shin; Sunjae Lee

arXiv:2512.12634·cs.AI·May 14, 2026

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee

PDF

TL;DR

MobiBench is a novel modular offline benchmarking framework for mobile GUI agents that achieves high fidelity, scalability, and reproducibility, enabling detailed component analysis and better evaluation practices.

Contribution

It introduces the first multi-path aware, modular offline benchmark for mobile GUI agents, addressing limitations of existing single-path and monolithic evaluation methods.

Findings

01

MobiBench achieves 94.72% agreement with human evaluators.

02

It enables detailed module-level analysis of mobile GUI agents.

03

The framework uncovers key insights into component contributions and limitations.

Abstract

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.