OpenPhone: Mobile Agentic Foundation Models
Yangqin Jiang, Chao Huang

TL;DR
OpenPhone is a mobile GUI agent system that combines on-device models with cloud collaboration, achieving high performance and cost efficiency for mobile platforms through a novel training and interaction management approach.
Contribution
It introduces a hybrid device-cloud framework with specialized training and real-time complexity assessment to optimize mobile GUI agent performance and reduce cloud costs.
Findings
Matches or nears larger models in performance
Significantly reduces cloud costs
Effective on the AndroidLab benchmark
Abstract
With the advancement of multimodal large language models (MLLMs), building GUI agent systems has become an increasingly promising direction--especially for mobile platforms, given their rich app ecosystems and intuitive touch interactions. Yet mobile GUI agents face a critical dilemma: truly on-device models (4B or smaller) lack sufficient performance, while capable models (starting from 7B) are either too large for mobile deployment or prohibitively costly (e.g., cloud-only closed-source MLLMs). To resolve this, we propose OpenPhone, a mobile GUI agent system that leverages device-cloud collaboration to tap the cost-efficiency of on device models and the high capability of cloud models, while avoiding their drawbacks. Specifically, OpenPhone enhances Qwen2.5-VL-3B via two-stage SFT->GRPO training on synthetic GUI data for strong decision-making, integrates an efficient long-reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
