Grounding Computer Use Agents on Human Demonstrations
Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han L\`u, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

TL;DR
This paper introduces GroundCUA, a large-scale desktop grounding dataset with expert annotations, enabling the development of models that accurately connect natural language instructions to on-screen elements, advancing computer-use agents.
Contribution
The paper presents GroundCUA, a comprehensive desktop grounding dataset with expert annotations, and the GroundNext models that achieve state-of-the-art performance with less training data.
Findings
GroundNext models outperform previous benchmarks.
Reinforcement learning enhances model performance.
High-quality datasets are crucial for general-purpose agents.
Abstract
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art…
Peer Reviews
Decision·ICLR 2026 Poster
1. A new SOTA performance on grounding tasks: with only one-tenth of the training data compared to previous methods, GROUNDNEXT achieves state-of-the-art performance on desktop benchmarks while also generalizing to OOD categories, demonstrating the effectiveness and huge contribution of this grounding dataset. 2. Action and task types that match real-world computer tasks: demonstrations are collected from real-world user experiences, and the applications span diverse real-world use cases, helpin
1. Although the authors demonstrate their dataset's effectiveness on the grounding task, they did not show whether this grounding ability transfers to improved performance on computer-use tasks. 2. After the model underwent reinforcement learning on 10K samples, the improvement was limited, casting doubt on the authors' motivation for using the reinforcement learning step.
GROUNDCUA fills a major gap in desktop grounding data. Most existing datasets focus on web or mobile interfaces, but desktop apps are way more complex with tiny icons and dense layouts. The human expert approach is smart - instead of automated scraping, they had people actually use the software and annotate everything they see. The results show that data quality beats data quantity. The two-stage approach with supervised fine-tuning plus reinforcement learning is pretty straightforward. No comp
The paper describes three instruction types but doesn't analyze how they affect model performance differently. The instruction generation relies heavily on prompting Qwen2.5-VL-72B. But there's no discussion of prompt sensitivity or failure cases. What happens when the LLM generates wrong instructions? The paper mentions using "about 100 templates" for textual elements and "120 templates" for general ones. But it doesn't say which templates work best or how template diversity impacts training.
GROUNDCUA features high quality, human verified supervision data. The elements are hand labeled from expert task demonstrations, giving reliable targets rather than noisy accessibility or synthetic signals. The dataset contains dense and fine grained coverage of samples, with screens averaging 64 labeled elements to support precise grounding at pixel-level granularity. The SFT model has demonstrated strong performance with modest data volume. GROUNDNEXT tops SFT baselines across five benchmark
As a dataset paper, it would be great that the authors can provide a link to the dataset.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
