What Matters in Building Vision-Language-Action Models for Generalist Robots

Xinghang Li; Peiyan Li; Long Qian; Minghuan Liu; Dong Wang; Jirong Liu; Bingyi Kang; Xiao Ma; Xinlong Wang; Di Guo; Tao Kong; Hanbo Zhang; Huaping Liu

arXiv:2412.14058·cs.RO·February 16, 2026

What Matters in Building Vision-Language-Action Models for Generalist Robots

Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, Tao Kong, Hanbo Zhang, Huaping Liu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates key design factors for Vision-Language-Action models in robotics, introduces RoboVLMs with minimal manual design, and provides extensive experimental insights and an open-source framework for future research.

Contribution

It identifies crucial design choices for VLAs, develops a new flexible family of RoboVLMs achieving state-of-the-art results, and offers comprehensive experimental guidance and open-source tools.

Findings

01

VLA performance depends on backbone, architecture, and data integration choices.

02

RoboVLMs achieve new state-of-the-art in simulation and real-world tasks.

03

Extensive experiments with over 8 backbones and 600 configurations provide detailed design insights.

Abstract

To utilize Foundation Vision Language Models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the Vision-Language-Action models (VLAs). In this work, we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Robot-VLAs/RoboVLMs
pytorch

Datasets

robovlms/bytedance_robot_benchmark_20
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsFocus