Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou

TL;DR
Xiaomi-Robotics-0 is a novel vision-language-action model designed for real-time robotic control, achieving state-of-the-art results in simulation and real-world tasks through optimized training and deployment strategies.
Contribution
The paper introduces Xiaomi-Robotics-0, a new VLA model with a unique training recipe and deployment techniques for real-time execution on robots, maintaining broad capabilities without forgetting pre-trained knowledge.
Findings
Achieves state-of-the-art performance in simulation benchmarks.
Runs smoothly and efficiently on real robots with consumer-grade GPUs.
Demonstrates high success rates and throughput in real-robot tasks.
Abstract
In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Advanced Neural Network Applications
