Chain of World: World Model Thinking in Latent Motion
Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma

TL;DR
CoWVLA introduces a unified latent motion and world-model framework for vision-language-action tasks, enhancing temporal reasoning and efficiency in embodied intelligence applications.
Contribution
It proposes a novel Chain-of-World paradigm that combines world-model reasoning with disentangled latent motion representations for improved visuomotor learning.
Findings
Outperforms existing world-model and latent-action methods on robotic benchmarks.
Achieves efficient and interpretable visuomotor pretraining.
Demonstrates the effectiveness of latent motion chains in continuous dynamic modeling.
Abstract
Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
