What Matters in Building Vision-Language-Action Models for Generalist Robots
Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, Tao Kong, Hanbo Zhang, Huaping Liu

TL;DR
This paper investigates key design factors for Vision-Language-Action models in robotics, introduces RoboVLMs with minimal manual design, and provides extensive experimental insights and an open-source framework for future research.
Contribution
It identifies crucial design choices for VLAs, develops a new flexible family of RoboVLMs achieving state-of-the-art results, and offers comprehensive experimental guidance and open-source tools.
Findings
VLA performance depends on backbone, architecture, and data integration choices.
RoboVLMs achieve new state-of-the-art in simulation and real-world tasks.
Extensive experiments with over 8 backbones and 600 configurations provide detailed design insights.
Abstract
To utilize Foundation Vision Language Models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the Vision-Language-Action models (VLAs). In this work, we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsFocus
