PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Yuheng Ji; Yuyang Liu; Huajie Tan; Xuchuan Huang; Fanding Huang; Yijie Xu; Cheng Chi; Yuting Zhao; Huaihai Lyu; Peterson Co; Mingyu Cao; Qiongyu Zhang; Zhe Li; Enshen Zhou; Pengwei Wang; Zhongyuan Wang; Shanghang Zhang; and Xiaolong Zheng

arXiv:2603.21669·cs.RO·March 24, 2026

PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Yuheng Ji, Yuyang Liu, Huajie Tan, Xuchuan Huang, Fanding Huang, Yijie Xu, Cheng Chi, Yuting Zhao, Huaihai Lyu, Peterson Co, Mingyu Cao, Qiongyu Zhang, Zhe Li, Enshen Zhou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, and Xiaolong Zheng

PDF

Open Access 1 Datasets

TL;DR

This paper introduces PRM-as-a-Judge, a dense evaluation framework using Process Reward Models to assess robotic policy execution from videos, capturing detailed progress and diagnosing subtle behavioral issues beyond success/failure metrics.

Contribution

It proposes a novel dense evaluation paradigm with the OPD metric system, formalizes properties for evaluation, and empirically demonstrates superior micro-resolution in robotic auditing.

Findings

01

PRM judges outperform existing methods in micro-scale progress discrimination

02

The OPD metric system effectively formalizes execution quality

03

Structured audits reveal behavioral signatures and failure modes

Abstract

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yuheng2000/RoboPulse
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning