Motus: A Unified Latent Action World Model

Hongzhe Bi; Hengkai Tan; Shenghao Xie; Zeyuan Wang; Shuhe Huang; Haitian Liu; Ruowen Zhao; Yao Feng; Chendong Xiang; Yinze Rong; Hongyan Zhao; Hanyu Liu; Zhizhong Su; Lei Ma; Hang Su; Jun Zhu

arXiv:2512.13030·cs.CV·December 29, 2025

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu

PDF

Open Access 3 Models

TL;DR

Motus is a unified latent action world model that integrates understanding, generation, and control using a Mixture-of-Transformer architecture, enabling flexible multimodal modeling and large-scale pretraining for robotic tasks.

Contribution

It introduces a novel Mixture-of-Transformer architecture and a comprehensive training pipeline for unified multimodal world modeling in embodied agents.

Findings

01

Achieves +15% to +45% performance improvements over state-of-the-art methods.

02

Demonstrates effective large-scale pretraining of latent actions.

03

Improves downstream robotic task performance in simulation and real-world scenarios.

Abstract

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation