UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability

Horia Cristescu; Charles Park; Trong Canh Nguyen; Sergiu Talmacel; Alexandru-Gabriel Ilie; Stefan Adam

arXiv:2511.17131·cs.SE·November 24, 2025

UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability

Horia Cristescu, Charles Park, Trong Canh Nguyen, Sergiu Talmacel, Alexandru-Gabriel Ilie, Stefan Adam

PDF

Open Access

TL;DR

UI-CUBE benchmarks enterprise-grade computer use agents across simple and complex tasks, exposing fundamental architectural limitations that hinder operational reliability in real-world enterprise automation scenarios.

Contribution

Introduces UI-CUBE, a comprehensive benchmark with diverse tasks and systematic evaluation methods to assess and diagnose the architectural weaknesses of current CUAs.

Findings

01

Simple tasks achieve 67-85% success, close to human performance.

02

Complex workflows drop to 9-19% success, revealing significant limitations.

03

Performance gap indicates architectural issues in memory, planning, and state management.

Abstract

While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for production systems. We present UI-CUBE (UiPath Computer Use BEnchmark), a systematic benchmark comprising 226 tasks across two difficulty tiers designed to expose fundamental architectural limitations in current CUAs. Our evaluation covers simple UI interactions (136 tasks) and complex workflows including copy-paste tasks (50 tasks) and enterprise application scenarios (40 tasks), with systematic interface variation coverage, multi-resolution testing and automated validation of task success through the application state. Evaluation of five state-of-the-art models reveals a sharp capability cliff rather than gradual performance degradation. Simple UI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Personal Information Management and User Behavior · Advanced Software Engineering Methodologies