Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Rui Cai; Jun Guo; Xinze He; Piaopiao Jin; Jie Li; Bingxuan Lin; Futeng Liu; Wei Liu; Fei Ma; Kun Ma; Feng Qiu; Heng Qu; Yifei Su; Qiao Sun; Dong Wang; Donghao Wang; Yunhong Wang; Rujie Wu; Diyun Xiang; Yu Yang; Hangjun Ye; Yuan Zhang; Quanyun Zhou

arXiv:2602.12684·cs.RO·March 26, 2026

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou

PDF

Open Access

TL;DR

Xiaomi-Robotics-0 is a novel vision-language-action model designed for real-time robotic control, achieving state-of-the-art results in simulation and real-world tasks through optimized training and deployment strategies.

Contribution

The paper introduces Xiaomi-Robotics-0, a new VLA model with a unique training recipe and deployment techniques for real-time execution on robots, maintaining broad capabilities without forgetting pre-trained knowledge.

Findings

01

Achieves state-of-the-art performance in simulation benchmarks.

02

Runs smoothly and efficiently on real robots with consumer-grade GPUs.

03

Demonstrates high success rates and throughput in real-robot tasks.

Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Advanced Neural Network Applications