TL;DR
This paper introduces an event-driven, step-level optimization framework for computer-use agents that adaptively allocates computational resources by escalating from small policies to large models only when necessary, improving efficiency.
Contribution
It proposes a modular, on-demand compute allocation method using risk monitors to enhance efficiency without retraining existing agents.
Findings
Reduces unnecessary large model calls in GUI tasks.
Detects and recovers from progress stalls and semantic drift.
Maintains performance while lowering computational costs.
Abstract
Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
