"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents
Marta Sumyk, Oleksandr Kosovan

TL;DR
This paper introduces a vision-based evaluation framework for autonomous computer use agents that assesses task completion from screenshots and descriptions, significantly improving their success rates.
Contribution
It presents a novel vision-language model-based method for autonomous task evaluation, covering multiple applications and tasks, enhancing agent reliability.
Findings
Achieves up to 73% accuracy in task success detection.
Improves overall task success by 27% with feedback.
Demonstrates effectiveness across 42 macOS applications.
Abstract
Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Software Engineering Research · Software Testing and Debugging Techniques
