Semantic World Models
Jacob Berg, Chuning Zhu, Yanda Bao, Ishan Durugkar, Abhishek Gupta

TL;DR
This paper introduces a semantic world modeling approach for robotics that predicts task-relevant information using vision-language models, improving planning and generalization over traditional pixel-based methods.
Contribution
It proposes framing world modeling as a visual question answering task, enabling the use of pretrained vision-language models for better robotic planning.
Findings
Semantic world models outperform pixel-based models in generalization.
Training with image-action-text data enhances decision-making.
Vision-language models improve robustness in robotic control.
Abstract
Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a…
Peer Reviews
Decision·Submitted to ICLR 2026
- Novel formulation: The paper introduces a novel conceptual framing of world models as future-tense semantic predictors rather than pixel predictors. This reframing is non-trivial and departs meaningfully from both pixel/latent-based world modeling and language-conditioned policy architectures (e.g., VLAs), offering a new axis for research in decision-making with foundation models. - Planning experiments: Compatible with both sampling and gradient refinement; the latter gives strong policy-imp
- SAQA depends on privileged state: The central data engine uses oracle simulation state to label future Q/A tuples; there’s no empirical demonstration of a real-data pipeline (weak/self-supervised QA labels, multi-view, proprio/contact cues, etc.). This is a potential barrier for training this method on real-world data and further deployment and, IMO, the main limitation. - From-scratch planning is not yet practical: Sampling with a large VLM is slow; the most effective mode is gradient refine
1. New framework: The core idea of shifting from pixel-level prediction to semantic, question-based prediction is novel and compelling. It directly addresses a known weakness of many video-based world models 2. Empirical results: The SWM achieves impressive results (Figure 5, Table 8), significantly outperforming the base policy and other baselines (IDQL, AVD). 3. Effective use of suboptimal data: The paper shows (Table 2) that model performance improves when trained on a combination of expert
1. The paper posits that a VLM could be used to decompose a high-level goal into these QA pairs (Section 2), but this is not demonstrated. The method requires a human to meticulously design a "curriculum" of questions to define a task, which is not scalable. 2. Comparisons: The comparison to the "Action Conditioned Video Diffusion" (AVD) baseline is not a fair "apples-to-apples" comparison. The AVD model is used to predict a future frame, and then the authors' own SWM model is used to perform V
- Originality — “future QA” as control objective. Defining task value via SWM answer likelihoods aligns the model objective with decision-making and avoids pixel prediction; the action-token projection turns a pretrained VLM into an action-conditioned model with minimal surgery. - Quality — clear task specification and planners. Tasks are explicitly defined via question/answer/weight sets (Table 7), and the value function includes a sub-chunking mechanism to encourage earlier completion (Eqs.
- Privileged-state supervision limits real-world portability. SAQA labels depend on oracle future state; acquiring comparable labels on hardware is challenging. A path to weaker supervision (pseudo-labels from frozen VLMs, success detectors) would strengthen applicability. - Baseline coverage for latent world models is narrow. Comparisons focus on AVD (pixel-level) and IDQL; omitting modern latent world-model planners (e.g., Dreamer/TD-MPC-style) makes it hard to isolate the benefit of semanti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · AI-based Problem Solving and Planning
