World-in-World: World Models in a Closed-Loop World

Jiahan Zhang; Muqing Jiang; Nanru Dai; Taiming Lu; Arda Uzunoglu; Shunchi Zhang; Yana Wei; Jiahao Wang; Vishal M. Patel; Paul Pu Liang; Daniel Khashabi; Cheng Peng; Rama Chellappa; Tianmin Shu; Alan Yuille; Yilun Du; Jieneng Chen

arXiv:2510.18135·cs.CV·October 22, 2025

World-in-World: World Models in a Closed-Loop World

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces World-in-World, a platform for evaluating generative world models in closed-loop environments, revealing that controllability and compute allocation are crucial for task success beyond visual quality.

Contribution

It presents the first benchmark for closed-loop evaluation of world models, along with a data scaling law and insights on factors influencing embodied task performance.

Findings

01

Controllability outweighs visual quality for task success.

02

Scaling with action-observation data improves performance more than model upgrades.

03

More inference compute enhances closed-loop performance.

Abstract

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper's main contribution is shifting the evaluation of world models from open-loop visual fidelity to closed-loop embodied task success. This is a significant and necessary service for the field. 2. The benchmark, through its Unified Action API, allows for direct, fair comparison of heterogeneous SOTA video generators across comprehensive embodied tasks. 3. The paper's "three surprises" are all well-supported by evidence and can provide insights for the community.

Weaknesses

1. The "three surprises" are to some extent overstated, especially the first and the second ones, as these findings have been pointed out by many previous papers, e.g. [1,2,3]. 2. The paper does not directly specify the reward function used to score the world model's imagined trajectories. 3. The method on how to refine the policy during planning is not that clear (beyond directly using the plan with the highest score). 4. There already exist some benchmarks that also evaluate the pretrained v

Reviewer 02Rating 6Confidence 3

Strengths

- The author has pointed out the “Open-loop bias” in the current world model evaluation,i.e., overemphasis on generation quality and neglect of its practical value in closed-loop decision-making. Wow! For the first time, the focus of assessment has shifted from “Looking like” to “Working with,” a paradigm shift that plays an important guiding role for embodied AI and the world model community. - This paper constructs four tasks covering perception, navigation, and operation, and evaluates more t

Weaknesses

- In the current framework, both the proposal and revision policies use powerful VLM (such as QWEN2.5-VL-72B). This makes the extent to which task success is attributable to the world model vs. The strategy itself is unclear. Complementary ablation experiments are suggested: for example, fixed-strategy capabilities (such as using weak or regular strategies), observing differences in performance gains across different world models, to more purely assess the decision-aid value of WM. - Habitat-sim

Reviewer 03Rating 8Confidence 4

Strengths

1) Addresses an existing gap in the research community on assessing world models for their ability to be integrated in decision making embodied agents, presenting a well-rounded general framework for closed-loop evaluation. This allows the research community to make informed decisions when pursuing research goals around improving the utility of world models for embodied scenarios. 2) Presents a comprehensive set of tasks for embodied agents - 4 tasks (active recognition, image-goal navigation, a

Weaknesses

Overall, the work provides a great set of references and supporting experiments. The paper is easy to follow, with detailed result figures and tables. A couple of minor suggestions: - I might have missed this, but I couldn't see the results of the A-EQA task evaluation from Table 2 discussed in more detail in Section 3. There is a stronger focus on the other 3 tasks, specifically AR, ImageNav and robotic manipulation. - In Fig 5(b), Wan2.2 5B (post-trained) seems to have high controllability, b

Code & Models

Datasets

zonszer/WIW_datasets
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Social Robot Interaction and HRI