Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline
Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, Haoang Li

TL;DR
This paper introduces CEBench, a comprehensive benchmark for vision-language-action models across diverse embodiments, and proposes LLaVA-VLA, a lightweight, practical VLA model that generalizes well and is suitable for real-world robotic applications.
Contribution
It presents a new benchmark CEBench and a novel lightweight VLA model, LLaVA-VLA, designed for practical deployment without extensive pre-training.
Findings
LLaVA-VLA demonstrates strong generalization across embodiments.
It achieves real-world mobile manipulation capabilities.
The benchmark supports diverse simulation and real-world data.
Abstract
Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
