Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Bing Li; Cheng Zheng; Wenxuan Zhu; Jinjie Mai; Biao Zhang; Peter; Wonka; Bernard Ghanem

arXiv:2406.08659·cs.CV·June 14, 2024·1 cites

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter, Wonka, Bernard Ghanem

PDF

Open Access

TL;DR

Vivid-ZOO introduces a diffusion-based method for generating high-quality, multi-view videos from text by leveraging pre-trained models and a novel factorization approach to ensure multi-view consistency and temporal coherence.

Contribution

The paper presents a new diffusion pipeline that combines pre-trained multi-view image and 2D video models for text-to-multi-view-video generation, reducing training costs and addressing domain gaps.

Findings

01

Generates high-quality multi-view videos with vivid motions.

02

Achieves multi-view consistency and temporal coherence.

03

Demonstrates effectiveness across diverse text prompts.

Abstract

While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Video Analysis and Summarization

MethodsALIGN · Diffusion