Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou, Zhuosheng Zhang

TL;DR
This paper introduces UI-NEXUS, a new benchmark for evaluating compositional generalization in mobile agents, and proposes AGENT-NEXUS, a scheduling system that improves task success rates by decomposing complex tasks into atomic subtasks.
Contribution
The paper presents UI-NEXUS, a comprehensive benchmark for compositional tasks, and introduces AGENT-NEXUS, a scheduling system that enhances existing mobile agents' performance on these tasks.
Findings
Existing agents struggle with compositional tasks, showing failure modes like under- and over-execution.
UI-NEXUS presents significant challenges, revealing the generalization gap.
AGENT-NEXUS improves task success rates by 24-40% on compositional tasks.
Abstract
Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Ferroelectric and Negative Capacitance Devices · Software System Performance and Reliability
MethodsSoftmax · Attention Is All You Need · travel james
