LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

Yuhang Huang; Jiazhao Zhang; Shilong Zou; Xinwang Liu; Ruizhen Hu; Kai Xu

arXiv:2505.11528·cs.RO·September 15, 2025

LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, Kai Xu

PDF

Open Access

TL;DR

LaDi-WM introduces a diffusion-based world model predicting future states in a latent space aligned with pre-trained visual models, significantly improving robot policy performance and generalizability in predictive manipulation tasks.

Contribution

The paper presents LaDi-WM, a novel latent diffusion-based world model that predicts future states in a learned latent space, enhancing accuracy and generalization over pixel-level predictions.

Findings

01

Improves policy performance by 27.9% on LIBERO-LONG benchmark.

02

Achieves 20% improvement in real-world scenarios.

03

Demonstrates strong generalizability in experiments.

Abstract

Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Diffusion