TL;DR
UI-Copilot introduces a collaborative GUI agent framework with tool-optimized policy training, significantly improving long-horizon task performance and generalization in complex user interface interactions.
Contribution
The paper proposes UI-Copilot, a novel framework combining a GUI agent with a copilot for memory and computation, and introduces TIPO for effective tool invocation learning.
Findings
UI-Copilot-7B achieves state-of-the-art results on MemGUI-Bench.
UI-Copilot-7B outperforms other 7B-scale GUI agents like GUI-Owl-7B.
UI-Copilot-7B improves AndroidWorld performance by 17.1%.
Abstract
MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
