VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Chongkai Gao; Zixuan Liu; Zhenghao Chi; Junshan Huang; Xin Fei; Yiwen Hou; Yuxuan Zhang; Yudi Lin; Zhirui Fang; Zeyu Jiang; Lin Shao

arXiv:2506.17561·cs.CV·June 24, 2025

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, Lin Shao

PDF

1 Models 1 Datasets 1 Video

TL;DR

This paper introduces VLA-OS, a unified architecture for Vision-Language-Action models, systematically analyzing how different planning paradigms and representations affect performance across diverse tasks and environments.

Contribution

The paper presents VLA-OS, a comprehensive framework enabling controlled comparison of planning paradigms and representations in VLA models, isolating their effects from architecture and data influences.

Findings

01

Visually grounded planning representations outperform language-based ones.

02

Hierarchical-VLA paradigm shows superior or comparable performance across multiple metrics.

03

Hierarchical-VLA offers better generalization and scalability despite slower training and inference.

Abstract

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Linslab/VLA-OS
model· ♡ 1
♡ 1

Datasets

Linslab/VLA-OS-Dataset
dataset· 947 dl
947 dl

Videos

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models· slideslive