Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Miaosen Zhang; Xiaohan Zhao; Zhihong Tan; Zhou Huoshen; Yijia Fan; Yifan Yang; Kai Qiu; Bei Liu; Justin Wagle; Chenzhong Yin; Mingxi Cheng; Ji Li; Qi Dai; Chong Luo; Xu Yang; Xin Geng; Baining Guo

arXiv:2605.12501·cs.CV·May 13, 2026

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo

PDF

1 Repo 1 Models

TL;DR

This paper introduces CUActSpot, a comprehensive benchmark and data synthesis pipeline for evaluating and improving computer-use agents' ability to handle complex, diverse GUI interactions across multiple modalities.

Contribution

It presents a new benchmark, CUActSpot, and a renderer-based data synthesis pipeline to enhance model training on complex GUI interactions beyond prior click-centric datasets.

Findings

01

Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters.

02

The benchmark covers five modalities and various actions, broadening interaction types.

03

The data synthesis pipeline automatically generates scenes and instructions for training.

Abstract

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/Phi-Ground.git
github

Models

🤗
microsoft/Phi-Ground-Any
model· 270 dl· ♡ 15
270 dl♡ 15

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.