Video-Based Reward Modeling for Computer-Use Agents

Linxin Song; Jieyu Zhang; Huanxin Sheng; Taiwei Shi; Gupta Rahul; Yang Liu; Ranjay Krishna; Jian Kang; Jieyu Zhao

arXiv:2603.10178·cs.CV·March 12, 2026

Video-Based Reward Modeling for Computer-Use Agents

Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu Zhao

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces a scalable, video-based reward model for evaluating computer-use agents, leveraging a new dataset and novel techniques to accurately assess task success from execution videos.

Contribution

The paper presents ExeVR-53k, a large dataset of execution videos with rewards, and a new reward model that predicts task success from videos, outperforming proprietary models.

Findings

01

ExeVR-53k dataset with 53k video-reward triplets

02

ExeVRM achieves 84.7% accuracy in success prediction

03

Model outperforms GPT-5.2 and Gemini-3 Pro in evaluations

Abstract

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lime-nlp/ExeVRM-8B
model· 82 dl· ♡ 3
82 dl♡ 3

Datasets

lime-nlp/ExeVR-53k
dataset· 100 dl
100 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition