LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, Kai Xu

TL;DR
LaDi-WM introduces a diffusion-based world model predicting future states in a latent space aligned with pre-trained visual models, significantly improving robot policy performance and generalizability in predictive manipulation tasks.
Contribution
The paper presents LaDi-WM, a novel latent diffusion-based world model that predicts future states in a learned latent space, enhancing accuracy and generalization over pixel-level predictions.
Findings
Improves policy performance by 27.9% on LIBERO-LONG benchmark.
Achieves 20% improvement in real-world scenarios.
Demonstrates strong generalizability in experiments.
Abstract
Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · Diffusion
