What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

TL;DR
This paper investigates the effectiveness of joint-embedding predictive world models (JEPA-WMs) for physical planning tasks, analyzing various design choices to enhance success in navigation and manipulation, and introduces a superior model.
Contribution
It provides a comprehensive analysis of key components in JEPA-WMs, identifying optimal configurations and demonstrating improved performance over existing baselines.
Findings
Model architecture, training objective, and planning algorithm significantly influence planning success.
The proposed model outperforms DINO-WM and V-JEPA-2-AC in navigation and manipulation tasks.
Code, data, and checkpoints are publicly available for reproducibility.
Abstract
A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper aims to cover many factors that are related to success in joint embedded predictive world models. Based on all experiments, the paper proposes a method that combines the best elements amongst all experiments. This method performs better than others on the majority of tasks. I am not familiar enough with these environments to evaluate the significance of this.
The paper seems to have too wide of a scope. There are many questions that arise from the results that could probably be papers, on their own. However, because of the wide scope, there is not room to delve deeper into these questions to any meaningful degree. Here are my main points to illustrate this: - The planning architectures are compared only using L1 and L2 distances from the goal image. These distances can be very uninformative, in practice. For example, there is convincing work showing
1. The paper is well written and clear in presentation 2. The paper performs a systematic and granular exploration of multiple design factors in JEPA-based world models, isolating the effects of planners, context size, encoder type, etc. This level of experimental control is rare in world model research and provides actionable guidance for future work. 3. Evaluations span simulated control (MetaWorld, Maze, Push-T, Wall) and real-robot datasets (DROID, Robocasa). The authors' conclusions general
Weaknesses + Questions here: 1. The paper provides a comprehensive empirical evaluation. Its contribution lies mainly in mapping hyperparameter effects within existing JEPA frameworks, making it more of a practitioner’s guide than a scientific leap. While this is not a dealbreaker for me, I think it hurts novelty a bit. 2. Despite the cost function being fully differentiable, the study does not include a gradient-based or hybrid planner comparison. In Fig. 3 (planning optimizers), the objective
- Broad and careful empirical study of JEPA based world models for planning across both simulated and real robot settings. The scope covers navigation and manipulation with diverse datasets and evaluation regimes. - Clear and useful findings that practitioners can act on. DINO encoders provide stronger fine grained spatial cues than V JEPA encoders for planning. Two step rollout loss and modest context help, with the constraint that the planning context should not exceed the training context. Ad
- Main novelty is empirical rather than algorithmic. The proposed model is a composition of known parts tuned within the JEPA WM family, which may limit the perceived conceptual advance. - Real world evaluation is limited to offline action matching on 16 Franka videos and qualitative rollouts. No closed loop robot trials are reported, which limits conclusions about deployment readiness. - Not all improvements are consistent. The model underperforms V JEPA 2 AC on Robocasa Reach. A deeper analysi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Robotic Path Planning Algorithms · Reinforcement Learning in Robotics
