MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development
Moshood A. Fakorede, Krishna Upadhyay, A.B. Siddique, Umar Farooq

TL;DR
MobileDev-Bench is a new benchmark for evaluating AI models on real-world mobile app issue resolution, highlighting the complexity and multi-file nature of mobile development fixes.
Contribution
It introduces a comprehensive benchmark with 407 real mobile app issues, enabling automated validation and revealing the challenges faced by current LLMs.
Findings
LLMs achieve only 3.23%–5.69% resolution rates on mobile issues.
Mobile fixes are complex, averaging 12.9 files and 334.6 lines changed.
41% of issues require coordinated changes across multiple artifact types.
Abstract
Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on library-style repositories, leaving mobile application development largely unexplored despite its framework-specific build systems, heterogeneous artifact types, and coordinated multi-file fix requirements. We introduce MobileDev-Bench, a benchmark comprising 407 real-world issue-resolution tasks collected from 19 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs a verified developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantially greater patch complexity than prior benchmarks: fixes modify 12.9 files and 334.6 lines on average, and 41%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
