CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi; Jiepeng Wang; Yuanzhi Liang; Xi Qiu; Jialun Liu; Hao Pan; Yuchi Huo; Rui Wang; Haibin Huang; Chi Zhang; Xuelong Li

arXiv:2511.21129·cs.CV·November 27, 2025

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces CtrlVDiff, a unified diffusion model that enables controllable, high-fidelity video generation and editing by integrating multiple modalities like depth, normals, and semantics, trained on a large hybrid dataset.

Contribution

It proposes a novel hybrid modality control strategy and a large aligned dataset to improve controllability and robustness in video diffusion models.

Findings

01

Superior controllability in video editing tasks.

02

Enhanced fidelity and temporal coherence.

03

Robust performance with missing modalities.

Abstract

We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection