VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, Lin Shao

TL;DR
This paper introduces VLA-OS, a unified architecture for Vision-Language-Action models, systematically analyzing how different planning paradigms and representations affect performance across diverse tasks and environments.
Contribution
The paper presents VLA-OS, a comprehensive framework enabling controlled comparison of planning paradigms and representations in VLA models, isolating their effects from architecture and data influences.
Findings
Visually grounded planning representations outperform language-based ones.
Hierarchical-VLA paradigm shows superior or comparable performance across multiple metrics.
Hierarchical-VLA offers better generalization and scalability despite slower training and inference.
Abstract
Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
