Learning World Models for Interactive Video Generation

Taiye Chen; Xun Hu; Zihan Ding; Chi Jin

arXiv:2505.21996·cs.CV·April 14, 2026

Learning World Models for Interactive Video Generation

Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

PDF

1 Video

TL;DR

This paper addresses the challenges in long video generation by enhancing world models with retrieval-augmented methods to improve coherence and reduce errors.

Contribution

It introduces VRAG, a retrieval-augmented approach with explicit global state conditioning, to significantly improve long-term spatiotemporal coherence in video generation.

Findings

01

VRAG reduces long-term compounding errors in video generation.

02

Explicit global state conditioning enhances spatiotemporal consistency.

03

Naive autoregressive methods are less effective for long-term video coherence.

Abstract

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning World Models for Interactive Video Generation· slideslive