Grounded World Model for Semantically Generalizable Planning
Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh

TL;DR
This paper introduces a Grounded World Model that uses vision-language-aligned embeddings to improve semantic generalization in planning tasks, enabling better zero-shot performance in unseen environments.
Contribution
It proposes a novel vision-language-aligned latent space for world modeling, enhancing semantic generalization in visuomotor planning compared to traditional vision-language models.
Findings
GWM-MPC achieves 87% success on unseen tasks in the WISER benchmark.
Traditional VLAs achieve only 22% success, indicating overfitting.
GWM-MPC outperforms existing methods in zero-shot generalization.
Abstract
In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
