World-consistent Video Diffusion with Explicit 3D Modeling

Qihang Zhang; Shuangfei Zhai; Miguel Angel Bautista; Kevin Miao,; Alexander Toshev; Joshua Susskind; Jiatao Gu

arXiv:2412.01821·cs.CV·December 3, 2024

World-consistent Video Diffusion with Explicit 3D Modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao,, Alexander Toshev, Joshua Susskind, Jiatao Gu

PDF

Open Access

TL;DR

This paper introduces WVD, a diffusion-based framework that explicitly models 3D consistency in video generation by learning joint RGB and XYZ representations, enabling flexible multi-view and camera-controlled synthesis.

Contribution

The paper presents a novel diffusion transformer that incorporates explicit 3D supervision with XYZ images, unifying multiple tasks like 3D generation, multi-view stereo, and camera-driven video synthesis.

Findings

01

Competitive performance on multiple benchmarks

02

Supports multi-task adaptability with inpainting strategy

03

Enables 3D-consistent video and image generation

Abstract

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Image and Signal Denoising Methods

MethodsSparse Evolutionary Training · Diffusion · Inpainting