"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

Marta Sumyk; Oleksandr Kosovan

arXiv:2511.20067·cs.AI·November 26, 2025

"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

Marta Sumyk, Oleksandr Kosovan

PDF

Open Access

TL;DR

This paper introduces a vision-based evaluation framework for autonomous computer use agents that assesses task completion from screenshots and descriptions, significantly improving their success rates.

Contribution

It presents a novel vision-language model-based method for autonomous task evaluation, covering multiple applications and tasks, enhancing agent reliability.

Findings

01

Achieves up to 73% accuracy in task success detection.

02

Improves overall task success by 27% with feedback.

03

Demonstrates effectiveness across 42 macOS applications.

Abstract

Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Software Engineering Research · Software Testing and Debugging Techniques