Chain of World: World Model Thinking in Latent Motion

Fuxiang Yang; Donglin Di; Lulu Tang; Xuancheng Zhang; Lei Fan; Hao Li; Chen Wei; Tonghua Su; Baorui Ma

arXiv:2603.03195·cs.CV·March 4, 2026

Chain of World: World Model Thinking in Latent Motion

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma

PDF

Open Access 1 Models

TL;DR

CoWVLA introduces a unified latent motion and world-model framework for vision-language-action tasks, enhancing temporal reasoning and efficiency in embodied intelligence applications.

Contribution

It proposes a novel Chain-of-World paradigm that combines world-model reasoning with disentangled latent motion representations for improved visuomotor learning.

Findings

01

Outperforms existing world-model and latent-action methods on robotic benchmarks.

02

Achieves efficient and interpretable visuomotor pretraining.

03

Demonstrates the effectiveness of latent motion chains in continuous dynamic modeling.

Abstract

Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hitfx/CoWVLA
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition