TL;DR
This paper investigates the unreliability of computer-use agents by analyzing factors like stochasticity, ambiguity, and variability, emphasizing the importance of repeated evaluations and stable strategies for improved reliability.
Contribution
It identifies key sources of unreliability in agents and provides insights into how task specification and behavior variability affect consistent performance.
Findings
Reliability depends on task specification and agent behavior variability.
Repeated execution analysis reveals sources of unreliability.
Strategies that adapt to ambiguity and maintain stability improve reliability.
Abstract
Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
