Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held

TL;DR
The paper presents LPWM, a self-supervised, object-centric world model that learns scene decompositions and stochastic dynamics from videos, enabling applications in decision-making without supervision.
Contribution
Introduces LPWM, a novel end-to-end trainable model that discovers scene structure and models stochastic dynamics directly from videos for real-world multi-object datasets.
Findings
Achieves state-of-the-art results on real-world and synthetic datasets.
Supports flexible conditioning on actions, language, and goals.
Demonstrates effectiveness in goal-conditioned imitation learning.
Abstract
We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web
Peer Reviews
Decision·ICLR 2026 Oral
1. The idea of incorporating object-centric methods into world model is well-motivated. The proposed method of flexible and supports diverse conditioning (actions, language, goals, multi-view), which is practical. 2. The authors provided extensive experimental results to show the effectiveness of the proposed LPWM. Results on real-world datasets (e.g., BAIR, Bridge) highlight its robustness beyond simulated environments. 3. LPWM achieves SOTA on stochastic video prediction (e.g., FVD of 85.45
1. The contributions are somewhat incremental, as the authors extend an existing video prediction method (DDLP) by introducing a context module for additional conditioning. However, this should not warrant rejection, given the extensive experiments demonstrating the effectiveness of these improvements. 2. The experiments could be strengthened. While the evaluations focus primarily on video prediction, the world model is intended for policy training. Comparisons with other world models (e.g., th
* Introducing per-particle action latent, rather than considering a single global action latent (as done in past works) is very sensible, especially for complex datasets, as it introduces an additional degree of freedom and increases the representational power of the model. * The idea of the context module that combines external conditioning with implicit actions is sound and effective. This provides a universal way to implement action conditioning. * I find the interplay between the inverse dyn
* The novelty is somewhat limited - the authors extend the previous work by intruding per-particle action latent. Nevertheless, it has shown improved results. * I feel like the choice to keep all M particles (and not to perform filtering to avoid tracking) sacrifices the ability to separate real objects from “empty” slots, and thus sacrificing interpretability and explicit object modelling - which in my opinion is a nice property of the original particle models.
The authors thoroughly motivate their work and propose a reasonable addition to particle-based generative dynamics models. The authors thoroughly ablate the base model changes as well as the proposed per-latent particle, which makes the evaluation of the method admirably robust. I have some minor questions on some of the scores, which I believe are due to my unfamiliarity with the datasets. While the writing could be improved (see below) the overall method is straightforward and clear, and the
The gains on the vision datasets seem very marginal, and not significant at a reasonable confidence interval. Provided confidence intervals seem to overlap, that makes the results very hard to judge. No information is provided on how confidence intervals (the +/- numbers in the tables) were computed. This makes the previous issue more problematic to assess. The robotics experiments do not seem to compare planning with other latent action methods with the proposed method. That limits the clarit
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
