DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

Yueci Deng; Guiliang Liu; Kui Jia

arXiv:2604.16484·cs.CV·April 21, 2026

DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

Yueci Deng, Guiliang Liu, Kui Jia

PDF

TL;DR

This paper introduces CLWM, a causal latent world model that improves robustness, reduces memory and latency, and enables scalable, zero-shot sim-to-real transfer for embodied manipulation tasks.

Contribution

The paper proposes CLWM with Dual-State TTT Memory and SAI inference, and EmbodiChain framework, advancing robust, efficient, and scalable embodied task learning.

Findings

01

CLWM achieves state-of-the-art in dual-arm simulation.

02

Unprecedented zero-shot sim-to-real transfer on robots.

03

Reduces latency by about 50% with SAI.

Abstract

Deploying generative World-Action Models for manipulation is severely bottlenecked by redundant pixel-level reconstruction, $O (T)$ memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual-State Test-Time Training (TTT) Memory that guarantees a strict $O (1)$ footprint for long-horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about $50%$ . To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.