Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods
Mengyuan Liu, Juyi Sheng, Peiming Li, Ziyi Wang, Tianming Xu, Tiantian Xu, Hong Liu

TL;DR
This paper introduces a new benchmark and auto-evaluation methods for trustworthy robotic manipulation assessment, addressing trust dimensions like source authenticity and execution quality, with high accuracy in policy versus teleoperation discrimination.
Contribution
It presents Eval-Actions benchmark and AutoEval architecture, integrating diverse data and signals for comprehensive, trustworthy evaluation of robotic policies.
Findings
AutoEval achieves SRCC of 0.81 and 0.84 under different protocols.
The framework can distinguish policy-generated from teleoperated videos with 99.6% accuracy.
The dataset includes failure scenarios and multiple supervision signals for robust evaluation.
Abstract
Driven by the rapid evolution of Vision-Action and Vision-Language-Action models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors. Current paradigms rely on binary success rates, failing to address the critical dimensions of trust: Source Authenticity (i.e., distinguishing genuine policy behaviors from human teleoperation) and Execution Quality (e.g., smoothness and safety). To bridge these gaps, we propose a solution that combines the Eval-Actions benchmark and the AutoEval architecture. First, we construct the Eval-Actions benchmark to support trustworthiness analysis. Distinct from existing datasets restricted to successful human demonstrations, Eval-Actions integrates VA and VLA policy execution trajectories alongside human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Adversarial Robustness in Machine Learning
