Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang; Jialong Wu; Qixing Zhou; Shangchen Miao; Mingsheng Long

arXiv:2505.14357·cs.CV·March 10, 2026

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long

PDF

Open Access 3 Reviews

TL;DR

Vid2World introduces a novel method to transform pre-trained video diffusion models into interactive world models, enhancing their controllability and applicability in complex decision-making environments.

Contribution

The paper proposes a systematic approach to adapt video diffusion models into interactive world models with causalization and action guidance mechanisms.

Findings

01

Effective transfer of video diffusion models to interactive environments

02

Improved action controllability in world models

03

Successful application across robotics, gaming, and navigation domains

Abstract

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The paper proposes an interesting solution to solve the problem of converting a non-causal video diffusion model into an autoregressive, interactive world model by manipulating the layers and further finetune the model. - By transferring priors from a pre-trained video diffusion model, Vid2World produces high-fidelity predictions that significantly outperform other world models on metrics like FVD and FID. - The approach is demonstrated across multiple, diverse domains.

Weaknesses

- The base model is pre-trained on passive, action-free videos, which has strong prior that may be mismatching the goal of world models. This potential mismatch and its effect should have been discussed by the paper. - The converted autoregressive model may suffer from error accumulation. Converting a diffusion model to such an AR model may make this problem even more pronounced.

Reviewer 02Rating 4Confidence 5

Strengths

S1) The proposed method is simple and effective. The writing is easy to follow. S2) I like the scope of the paper, which is trying to establish general world models for various domains through adapting from foundation video models.

Weaknesses

W1) I believe the core contribution of this paper is how to causalizing temporal convolution layers. Unfortunately, the most recent video models (e.g., NVIDIA Cosmos and WAN, also mentioned by the authors) are typically using pure DiT architectures without temporal convolutions. Therefore, the main story of this paper is not applicable to those models, which somewhat diminishes the significance of the contribution. W2) Frame-level action conditioning is not new, and the paper misrepresents its

Reviewer 03Rating 8Confidence 3

Strengths

The problem is relevant and interesting. The paper is linguistically well written (in places, it sounds like LLM lingo). The authors tackle the problem by applying a clever combination of methods, mostly from the literature. The execution seems competent. The empirical results look strong. I enjoyed reading the paper.

Weaknesses

• The method description does not provide clear intuition for why the action conditioning works; the paper could better convey the functioning of the model (beyond the mechanics). • The use of the word “causal” is ambiguous—sometimes merely time-directional, while elsewhere invoking counterfactuals/interventions in a broader sense. This limits the potential audience of the paper.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies

MethodsDiffusion