Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Yihao Zhang; Yuankai Qi; Xi Zheng

arXiv:2511.11298·cs.RO·November 17, 2025

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Yihao Zhang, Yuankai Qi, Xi Zheng

PDF

Open Access

TL;DR

This paper empirically benchmarks four vision-language-action models for robotic manipulation, evaluating their performance, adaptability, and failure modes across simulation and real-world platforms to inform deployment trade-offs.

Contribution

It introduces a standardized evaluation framework and provides comparative insights into model performance, adaptability, and computational demands in real-world robotic manipulation.

Findings

01

$oldmath{$m{ ext{pi}_0}$}$ shows superior out-of-distribution adaptability

02

ACT offers high in-distribution stability

03

Identifies common failure modes like near-miss grasps

Abstract

Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{ $π_{0}$ } -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics