Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

Mengyuan Liu; Juyi Sheng; Peiming Li; Ziyi Wang; Tianming Xu; Tiantian Xu; Hong Liu

arXiv:2601.18723·cs.RO·January 27, 2026

Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

Mengyuan Liu, Juyi Sheng, Peiming Li, Ziyi Wang, Tianming Xu, Tiantian Xu, Hong Liu

PDF

Open Access

TL;DR

This paper introduces a new benchmark and auto-evaluation methods for trustworthy robotic manipulation assessment, addressing trust dimensions like source authenticity and execution quality, with high accuracy in policy versus teleoperation discrimination.

Contribution

It presents Eval-Actions benchmark and AutoEval architecture, integrating diverse data and signals for comprehensive, trustworthy evaluation of robotic policies.

Findings

01

AutoEval achieves SRCC of 0.81 and 0.84 under different protocols.

02

The framework can distinguish policy-generated from teleoperated videos with 99.6% accuracy.

03

The dataset includes failure scenarios and multiple supervision signals for robust evaluation.

Abstract

Driven by the rapid evolution of Vision-Action and Vision-Language-Action models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors. Current paradigms rely on binary success rates, failing to address the critical dimensions of trust: Source Authenticity (i.e., distinguishing genuine policy behaviors from human teleoperation) and Execution Quality (e.g., smoothness and safety). To bridge these gaps, we propose a solution that combines the Eval-Actions benchmark and the AutoEval architecture. First, we construct the Eval-Actions benchmark to support trustworthiness analysis. Distinct from existing datasets restricted to successful human demonstrations, Eval-Actions integrates VA and VLA policy execution trajectories alongside human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Adversarial Robustness in Machine Learning