Decoupled Video Generation with Chain of Training-free Diffusion Model   Experts

Wenhao Li; Yichao Cao; Xiu Su; Xi Lin; Shan You; Mingkai Zheng; Yi; Chen; Chang Xu

arXiv:2408.13423·cs.CV·December 30, 2024

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts

Wenhao Li, Yichao Cao, Xiu Su, Xi Lin, Shan You, Mingkai Zheng, Yi, Chen, Chang Xu

PDF

Open Access

TL;DR

ConFiner introduces a decoupled, efficient video generation framework using multiple diffusion experts, significantly reducing computational costs while producing high-quality, long, coherent videos.

Contribution

It proposes a novel decoupled approach with chain of training-free diffusion experts and coordinated denoising for efficient, high-quality video synthesis.

Findings

01

Surpasses Lavie and Modelscope in quality at 10% inference cost

02

Generates videos up to 600 frames long with coherence

03

Achieves high subjective and objective metrics

Abstract

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to extreme complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsDiffusion