SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?
Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, Jiaxuan You

TL;DR
SWE-Bench Mobile is a comprehensive benchmark designed to evaluate large language model agents on realistic, industry-level mobile app development tasks, revealing significant gaps in current capabilities and providing insights for improvement.
Contribution
The paper introduces SWE-Bench Mobile, a new benchmark that captures the full complexity of industrial mobile app development for evaluating LLM agents.
Findings
Best agent achieves only 12% success rate
Agent design significantly impacts performance, up to 6× difference
Commercial agents outperform open-source counterparts
Abstract
Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6 performance gap across agents, (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Software Engineering Techniques and Practices · Artificial Intelligence in Law
