DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation
Haoyu Zhao, Yuang Zhang, Junqi Cheng, Jiaxi Gu, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

TL;DR
DCDM introduces a unified framework that decomposes video consistency into three specialized components, improving semantic, geometric, and identity coherence in video generation through innovative diffusion techniques and structured representations.
Contribution
The paper presents a novel divide-and-conquer diffusion model that explicitly models intra-clip, inter-clip, and inter-shot consistency, enhancing video quality and coherence.
Findings
Effective intra-clip semantic consistency via language parsing and diffusion transformer.
Stable camera motion control through noise space representation.
Long-range narrative coherence with windowed cross-attention.
Abstract
Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
