CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents
Marta Sumyk, Oleksandr Kosovan

TL;DR
This paper evaluates vision-language models as autonomous auditors for computer-use agents across multiple desktop environments, revealing their strengths and limitations in assessing task success and highlighting the need for improved reliability measures.
Contribution
It introduces a large-scale meta-evaluation of VLMs as auditors for CUAs, analyzing their accuracy, calibration, and agreement across diverse environments and benchmarks.
Findings
VLM auditors achieve high accuracy and calibration in simple environments.
Performance degrades significantly in complex or heterogeneous settings.
High disagreement among models indicates fundamental limitations of current auditing approaches.
Abstract
Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersonal Information Management and User Behavior · Multimodal Machine Learning Applications · Usability and User Interface Design
