UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang; Zhen Yang; Zi-Yi Dou; Anh Nguyen; Keen You; Omar Attia; Andrew Szot; Michael Feng; Ram Ramrakhya; Alexander Toshev; Chao Huang; Yinfei Yang; Zhe Gan

arXiv:2510.17790·cs.CV·December 12, 2025

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

PDF

Open Access

TL;DR

UltraCUA introduces a hybrid action foundation model that combines primitive GUI operations with high-level tool execution, significantly improving the robustness and efficiency of computer-use agents across diverse tasks.

Contribution

It presents a novel hybrid action framework, a scalable data pipeline, and a two-stage training process, enabling agents to seamlessly integrate GUI primitives with tool-based actions.

Findings

01

22% performance improvement on OSWorld

02

11% faster execution compared to existing methods

03

21.7% success rate on WindowsAgentArena

Abstract

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Software Engineering Methodologies · Reinforcement Learning in Robotics