SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

Muxin Tian; Zhe Wang; Blair Yang; Zhenwei Tang; Kunlun Zhu; Honghua Dong; Hanchen Li; Xinni Xie; Guangjing Wang; Jiaxuan You

arXiv:2602.09540·cs.SE·February 11, 2026

SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, Jiaxuan You

PDF

Open Access

TL;DR

SWE-Bench Mobile is a comprehensive benchmark designed to evaluate large language model agents on realistic, industry-level mobile app development tasks, revealing significant gaps in current capabilities and providing insights for improvement.

Contribution

The paper introduces SWE-Bench Mobile, a new benchmark that captures the full complexity of industrial mobile app development for evaluating LLM agents.

Findings

01

Best agent achieves only 12% success rate

02

Agent design significantly impacts performance, up to 6× difference

03

Commercial agents outperform open-source counterparts

Abstract

Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6 $\times$ performance gap across agents, (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Software Engineering Techniques and Practices · Artificial Intelligence in Law