Controllable Video Generation: A Survey

Yue Ma; Kunyu Feng; Zhongyuan Hu; Xinyu Wang; Yucheng Wang; Mingzhe Zheng; Bingyuan Wang; Qinghe Wang; Xuanhua He; Hongfa Wang; Chenyang Zhu; Hongyu Liu; Yingqing He; Zeyu Wang; Zhifeng Li; Xiu Li; Sirui Han; Yike Guo; Wei Liu; Dan Xu; Linfeng Zhang; Qifeng Chen

arXiv:2507.16869·cs.GR·January 21, 2026

Controllable Video Generation: A Survey

Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Sirui Han, Yike Guo, Wei Liu, Dan Xu, Linfeng Zhang, Qifeng Chen

PDF

Open Access

TL;DR

This survey reviews recent advances in controllable video generation, emphasizing methods that incorporate multi-modal conditions to improve user control over AI-generated videos.

Contribution

It provides a comprehensive overview of theoretical foundations, control mechanisms, and categorization of controllable video generation methods, highlighting recent progress and open challenges.

Findings

01

Integration of non-textual conditions enhances control in video synthesis.

02

Categorization of methods based on control signals used.

03

Analysis of control mechanisms in diffusion-based models.

Abstract

With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Video Analysis and Summarization