MotuBrain: An Advanced World Action Model for Robot Control

MotuBrain Team; Chendong Xiang; Fan Bao; Haitian Liu; Hengkai Tan; Hongzhe Bi; James Li; Jiabao Liu; Jingrui Pang; Kiro Jing; Louis Liu; Mengchen Cai; Rongxu Cui; Ruowen Zhao; Runqing Wang; Shuhe Huang; Yao Feng; Yinze Rong; Zeyuan Wang; Jun Zhu

arXiv:2604.27792·cs.RO·May 4, 2026

MotuBrain: An Advanced World Action Model for Robot Control

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, Jun Zhu

PDF

TL;DR

MotuBrain is a comprehensive world action model for robot control that unifies video and action modeling, enabling efficient, accurate, and adaptable real-world robot tasks with significant speed improvements.

Contribution

It introduces a novel unified architecture with multimodal, multiview, and language-action capabilities, advancing robot control and world modeling.

Findings

01

Achieves over 50x speedup in inference speed.

02

Attains 95.8% success on RoboTwin 2.0 tasks.

03

Adapts to new humanoid robots with limited data.

Abstract

Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.