Do World Action Models Generalize Better than VLAs? A Robustness Study
Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang

TL;DR
This study compares world action models and vision-language-action policies in robotic tasks, showing WAMs' superior robustness and generalization, especially under visual and language perturbations.
Contribution
It provides a comprehensive comparison of WAMs and VLAs, highlighting WAMs' robustness advantages and the impact of hybrid approaches in robotic action planning.
Findings
WAMs achieve up to 82.2% success rate on LIBERO-Plus.
WAMs demonstrate strong robustness under various perturbations.
VLAs require extensive training to match WAMs' robustness.
Abstract
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
