CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun

TL;DR
CUARewardBench is a comprehensive benchmark for evaluating reward models on computer-using agents, revealing current limitations and proposing an ensemble method that significantly improves evaluation reliability.
Contribution
This paper introduces the first benchmark for CUA reward models, a diverse dataset, and a novel ensemble method to enhance reward evaluation accuracy.
Findings
Current reward models have limited visual reasoning capabilities.
General vision-language models outperform specialized CUA models in reward evaluation.
Unanimous prompt ensemble significantly improves reward model reliability.
Abstract
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Software Engineering Methodologies · Explainable Artificial Intelligence (XAI) · AI-based Problem Solving and Planning
