OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
Reyna Abhyankar, Qi Qi, Yiying Zhang

TL;DR
This paper analyzes the temporal efficiency of computer-use AI agents on OSWorld, revealing high latency issues and providing a benchmark for agent performance and improvements.
Contribution
It presents the first study on the temporal performance of computer-use agents, introduces OSWorld Human for detailed evaluation, and benchmarks 16 agents' efficiency.
Findings
Large model calls significantly contribute to latency.
Agent steps increase in duration as tasks progress.
Even top agents use substantially more steps than necessary.
Abstract
Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersonal Information Management and User Behavior · Advanced Software Engineering Methodologies · Artificial Intelligence in Games
