VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation
Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, Lei Ma

TL;DR
This paper introduces VLATest, a fuzzing framework for testing vision-language-action models in robotic manipulation, revealing their current lack of robustness across diverse scenarios and conditions.
Contribution
We developed VLATest to generate diverse testing scenes and conducted an empirical study on seven VLA models, exposing their robustness limitations.
Findings
VLA models lack robustness in diverse scenarios
Performance drops with confounding objects and lighting changes
Unseen objects and instruction mutations significantly affect accuracy
Abstract
The rapid advancement of generative AI and multi-modal foundation models has shown significant potential in advancing robotic manipulation. Vision-language-action (VLA) models, in particular, have emerged as a promising approach for visuomotor control by leveraging large-scale vision-language data and robot demonstrations. However, current VLA models are typically evaluated using a limited set of hand-crafted scenes, leaving their general performance and robustness in diverse scenarios largely unexplored. To address this gap, we present VLATest, a fuzzing framework designed to generate robotic manipulation scenes for testing VLA models. Based on VLATest, we conducted an empirical study to assess the performance of seven representative VLA models. Our study results revealed that current VLA models lack the robustness necessary for practical deployment. Additionally, we investigated the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotics and Automated Systems
