CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

Haojia Lin; Xiaoyu Tan; Yulei Qin; Zihan Xu; Yuchen Shi; Zongyi Li; Gang Li; Shaofei Cai; Siqi Cai; Chaoyou Fu; Ke Li; Xing Sun

arXiv:2510.18596·cs.SE·October 22, 2025

CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun

PDF

Open Access

TL;DR

CUARewardBench is a comprehensive benchmark for evaluating reward models on computer-using agents, revealing current limitations and proposing an ensemble method that significantly improves evaluation reliability.

Contribution

This paper introduces the first benchmark for CUA reward models, a diverse dataset, and a novel ensemble method to enhance reward evaluation accuracy.

Findings

01

Current reward models have limited visual reasoning capabilities.

02

General vision-language models outperform specialized CUA models in reward evaluation.

03

Unanimous prompt ensemble significantly improves reward model reliability.

Abstract

Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Software Engineering Methodologies · Explainable Artificial Intelligence (XAI) · AI-based Problem Solving and Planning