Yume: An Interactive World Generation Model

Xiaofeng Mao; Shaoheng Lin; Zhen Li; Chuanhao Li; Wenshuo Peng; Tong He; Jiangmiao Pang; Mingmin Chi; Yu Qiao; Kaipeng Zhang

arXiv:2507.17744·cs.CV·July 24, 2025

Yume: An Interactive World Generation Model

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang

PDF

Open Access 2 Models

TL;DR

Yume is an interactive world generation model that creates realistic, dynamic environments from images, text, or videos, enabling exploration and control through various input methods with high fidelity and responsiveness.

Contribution

The paper introduces extit{ extbf{Yume}}, a novel framework for interactive world generation combining camera motion quantization, a new video diffusion transformer, and advanced sampling and acceleration techniques.

Findings

01

Achieves high-quality, diverse scene generation

02

Enables user interaction via keyboard and neural signals

03

Demonstrates effective real-time exploration and control

Abstract

Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Video Analysis and Summarization