DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

En Yu; Haoran Lv; Jianjian Sun; Kangheng Lin; Ruitao Zhang; Yukang Shi; Yuyang Chen; Ze Chen; Ziheng Zhang; Fan Jia; Kaixin Liu; Meng Zhang; Ruitao Hao; Saike Huang; Songhan Xie; Yu Liu; Zhao Wu; Bin Xie; Pengwei Zhang; Qi Yang; Xianchi Deng; Yunfei Wei; Enwen Zhang; Hongyang Peng; Jie Zhao; Kai Liu; Wei Sun; Yajun Wei; Yi Yang; Yunqiao Zhang; Ziwei Yan; Haitao Yang; Hao Liu; Haoqiang Fan; Haowei Zhang; Junwen Huang; Yang Chen; Yunchao Ma; Yunhuan Yang; Zhengyuan Du; Ziming Liu; Jiahui Niu; Yucheng Zhao; Daxin Jiang; Wenbin Tang; Xiangyu Zhang; Zheng Ge; Erjin Zhou; Tiancai Wang

arXiv:2602.14974·cs.RO·February 17, 2026

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

En Yu, Haoran Lv, Jianjian Sun, Kangheng Lin, Ruitao Zhang, Yukang Shi, Yuyang Chen, Ze Chen, Ziheng Zhang, Fan Jia, Kaixin Liu, Meng Zhang, Ruitao Hao, Saike Huang, Songhan Xie, Yu Liu, Zhao Wu, Bin Xie, Pengwei Zhang, Qi Yang, Xianchi Deng, Yunfei Wei, Enwen Zhang

PDF

Open Access

TL;DR

DM0 introduces a unified vision-language-action framework for Physical AI, integrating diverse data sources and reasoning strategies to improve embodied manipulation and navigation tasks.

Contribution

It presents a comprehensive three-stage training pipeline and a novel embodied spatial scaffolding method for better physical grounding and reasoning.

Findings

01

Achieves state-of-the-art results on RoboChallenge benchmark.

02

Effectively combines high-level reasoning with low-level control.

03

Demonstrates improved generalization in embodied tasks.

Abstract

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Machine Learning in Healthcare