OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Mariya Davydova; Daniel Jeffries; Patrick Barker; Arturo M\'arquez; Flores; Sin\'ead Ryan

arXiv:2505.03570·cs.AI·May 7, 2025

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo M\'arquez, Flores, Sin\'ead Ryan

PDF

Open Access 1 Repo

TL;DR

OSUniverse is a comprehensive benchmark designed to evaluate multimodal GUI-navigation AI agents across complex tasks, with automated validation and scalable difficulty, facilitating progress measurement in AI capabilities.

Contribution

It introduces a new, extensible benchmark with automated validation for assessing multimodal GUI-navigation AI agents' performance on complex desktop tasks.

Findings

01

State-of-the-art agents score below 50% on benchmark tasks

02

Automated validation achieves less than 2% error rate

03

Benchmark covers a range of task complexities from simple to multiapplication

Abstract

In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

agentsea/osuniverse
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems