CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Xiangru Jian; Shravan Nayak; Kevin Qinghong Lin; Aarash Feizi; Kaixin Li; Patrice Bechard; Spandana Gella; Sai Rajeswar

arXiv:2603.24440·cs.LG·March 26, 2026

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar

PDF

Open Access 1 Datasets

TL;DR

CUA-Suite introduces a large-scale, high-quality video dataset of human desktop interactions, enabling significant advancements in training and evaluating computer-use agents for complex tasks.

Contribution

It provides the first extensive collection of continuous, expert-annotated desktop videos and related resources, addressing the scarcity of high-quality demonstration data for CUAs.

Findings

01

Current models have about 60% task failure rate on professional applications.

02

Continuous video data captures full interaction dynamics, improving potential for agent training.

03

Rich multimodal data supports new research directions in desktop automation.

Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ServiceNow/VideoCUA
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition