Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, Zongqing Lu, Qin Jin

TL;DR
This paper systematically investigates how scaling Vision-Language-Action models affects robotic control, revealing critical factors like physical alignment and data heterogeneity, and providing practical training insights.
Contribution
It offers a comprehensive analysis of VLA model scaling in robotics, highlighting the importance of physical alignment, the risks of data pooling, and the limited impact of regularization strategies.
Findings
Unified end-effector relative action representation improves cross-embodiment transfer.
Naive pooling of heterogeneous robot data can cause negative transfer.
Regularization strategies like sensory dropout do not consistently enhance performance.
Abstract
While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is inherently heterogeneous across embodiments, sensors, and action spaces. We present a systematic, controlled study of VLA scaling that revisits core training choices for pretraining across diverse robots. Using a representative VLA framework that combines a vision-language backbone with flow-matching, we ablate key design decisions under matched conditions and evaluate in extensive simulation and real-robot experiments. To improve the reliability of real-world results, we introduce a Grouped Blind Ensemble protocol that blinds operators to model identity and separates policy execution from outcome judgment, reducing experimenter bias. Our analysis targets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
