OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar; Qi Qi; Yiying Zhang

arXiv:2506.16042·cs.AI·May 19, 2026

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar, Qi Qi, Yiying Zhang

PDF

TL;DR

This paper analyzes the temporal efficiency of computer-use AI agents on OSWorld, revealing high latency issues and providing a benchmark for agent performance and improvements.

Contribution

It presents the first study on the temporal performance of computer-use agents, introduces OSWorld Human for detailed evaluation, and benchmarks 16 agents' efficiency.

Findings

01

Large model calls significantly contribute to latency.

02

Agent steps increase in duration as tasks progress.

03

Even top agents use substantially more steps than necessary.

Abstract

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersonal Information Management and User Behavior · Advanced Software Engineering Methodologies · Artificial Intelligence in Games