DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

Haoyu Zhao; Yuang Zhang; Junqi Cheng; Jiaxi Gu; Zenghui Lu; Peng Shu; Zuxuan Wu; Yu-Gang Jiang

arXiv:2602.13637·cs.CV·February 17, 2026

DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

Haoyu Zhao, Yuang Zhang, Junqi Cheng, Jiaxi Gu, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

DCDM introduces a unified framework that decomposes video consistency into three specialized components, improving semantic, geometric, and identity coherence in video generation through innovative diffusion techniques and structured representations.

Contribution

The paper presents a novel divide-and-conquer diffusion model that explicitly models intra-clip, inter-clip, and inter-shot consistency, enhancing video quality and coherence.

Findings

01

Effective intra-clip semantic consistency via language parsing and diffusion transformer.

02

Stable camera motion control through noise space representation.

03

Long-range narrative coherence with windowed cross-attention.

Abstract

Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis