On the Reliability of Computer Use Agents

Gonzalo Gonzalez-Pumariega; Saaket Agashe; Jiachen Yang; Ang Li; Xin Eric Wang

arXiv:2604.17849·cs.AI·April 21, 2026

On the Reliability of Computer Use Agents

Gonzalo Gonzalez-Pumariega, Saaket Agashe, Jiachen Yang, Ang Li, Xin Eric Wang

PDF

1 Repo

TL;DR

This paper investigates the unreliability of computer-use agents by analyzing factors like stochasticity, ambiguity, and variability, emphasizing the importance of repeated evaluations and stable strategies for improved reliability.

Contribution

It identifies key sources of unreliability in agents and provides insights into how task specification and behavior variability affect consistent performance.

Findings

01

Reliability depends on task specification and agent behavior variability.

02

Repeated execution analysis reveals sources of unreliability.

03

Strategies that adapt to ambiguity and maintain stability improve reliability.

Abstract

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

simular-ai/cua_reliability
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.