Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

Wenxuan Song; Jiayi Chen; Xiaoquan Sun; Huashuo Lei; Yikai Qin; Wei Zhao; Pengxiang Ding; Han Zhao; Tongxin Wang; Pengxu Hou; Zhide Zhong; Haodong Yan; Donglin Wang; Jun Ma; Haoang Li

arXiv:2602.22663·cs.RO·February 27, 2026

Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, Haoang Li

PDF

Open Access

TL;DR

This paper introduces CEBench, a comprehensive benchmark for vision-language-action models across diverse embodiments, and proposes LLaVA-VLA, a lightweight, practical VLA model that generalizes well and is suitable for real-world robotic applications.

Contribution

It presents a new benchmark CEBench and a novel lightweight VLA model, LLaVA-VLA, designed for practical deployment without extensive pre-training.

Findings

01

LLaVA-VLA demonstrates strong generalization across embodiments.

02

It achieves real-world mobile manipulation capabilities.

03

The benchmark supports diverse simulation and real-world data.

Abstract

Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning