Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
Yihao Zhang, Yuankai Qi, Xi Zheng

TL;DR
This paper empirically benchmarks four vision-language-action models for robotic manipulation, evaluating their performance, adaptability, and failure modes across simulation and real-world platforms to inform deployment trade-offs.
Contribution
It introduces a standardized evaluation framework and provides comparative insights into model performance, adaptability, and computational demands in real-world robotic manipulation.
Findings
$oldmath{$m{ ext{pi}_0}$}$ shows superior out-of-distribution adaptability
ACT offers high in-distribution stability
Identifies common failure modes like near-miss grasps
Abstract
Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
