UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability
Horia Cristescu, Charles Park, Trong Canh Nguyen, Sergiu Talmacel, Alexandru-Gabriel Ilie, Stefan Adam

TL;DR
UI-CUBE benchmarks enterprise-grade computer use agents across simple and complex tasks, exposing fundamental architectural limitations that hinder operational reliability in real-world enterprise automation scenarios.
Contribution
Introduces UI-CUBE, a comprehensive benchmark with diverse tasks and systematic evaluation methods to assess and diagnose the architectural weaknesses of current CUAs.
Findings
Simple tasks achieve 67-85% success, close to human performance.
Complex workflows drop to 9-19% success, revealing significant limitations.
Performance gap indicates architectural issues in memory, planning, and state management.
Abstract
While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for production systems. We present UI-CUBE (UiPath Computer Use BEnchmark), a systematic benchmark comprising 226 tasks across two difficulty tiers designed to expose fundamental architectural limitations in current CUAs. Our evaluation covers simple UI interactions (136 tasks) and complex workflows including copy-paste tasks (50 tasks) and enterprise application scenarios (40 tasks), with systematic interface variation coverage, multi-resolution testing and automated validation of task success through the application state. Evaluation of five state-of-the-art models reveals a sharp capability cliff rather than gradual performance degradation. Simple UI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Personal Information Management and User Behavior · Advanced Software Engineering Methodologies
