Controllable Video Generation with Provable Disentanglement
Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Namrata Deka, Zongfang Liu, Zeyu Tang, Guangyi Chen, Kun Zhang

TL;DR
This paper introduces CoVoGAN, a novel framework for controllable video generation that disentangles static and dynamic concepts, providing precise control and theoretical guarantees of identifiability, validated through extensive experiments.
Contribution
We propose a theoretically grounded method for disentangling video concepts, enabling independent control over static and dynamic features in generated videos.
Findings
Improved control precision in video generation.
Theoretical proof of latent variable identifiability.
Enhanced generation quality demonstrated on benchmarks.
Abstract
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling disentangled control of video generation. To establish the theoretical foundation, we provide a rigorous analysis…
Peer Reviews
Decision·ICLR 2026 Poster
The paper provides a strong theoretical foundation by proving block-wise and component-wise identifiability under mild assumptions. This formalizes why disentanglement between motion and content is achievable, addressing gaps in prior heuristic-based methods. The TTM effectively combines GRU for temporal dependency modeling and Deep Sigmoid Flow for conditional independence, ensuring dynamic variables are disentangled and controllable. Ablation studies validate the critical role of GRU and flow
While CoVoGAN excels in unsupervised disentanglement, it does not address how to align latent dimensions with semantic labels (e.g., explicitly mapping a dimension to "eye blinking"). Integrating weak supervision or text guidance could enhance interpretability and usability. Additionally, the article lacks experiments on the UCF101 dataset, and it is recommended to supplement them.
1. The model successfully achieves both block-wise identifiability (separating content from motion) and component-wise identifiability (separating distinct factors within motion). 2. CoVoGAN achieves the best performance across disentanglement metrics (MCC, SAP, Modularity) compared to strong baselines like StyleGAN-V, MoStGAN-V, LVDM, and Latte. 3. CoVoGAN is computationally efficient, having a generator parameter count (24.98M) comparable to StyleGAN2-ADA and significantly smaller than other
1. The primary implementation and theoretical guarantees are grounded in the GAN framework, how to integrate the idea into current diffusion-based models? Compared to diffusion models, GAN-based approaches still have large visual fidelity gap. 2. Experiments only focus on domain-specific datasets, how will be the performance of CoVoGAN on UCF101 or even larger dataset such as Kinetics? It would be better if the author could show the abilities of the model on more complex datasets. 3. Can CoVoGAN
1. This paper provides strong theoretical foundation. It provide rigorous identifiability theorems for video generation, offering mathematical guarantees for disentanglement at both block-wise and component-wise levels. 2. This paper enables independent manipulation of specific motion concepts (e.g., eye blinking, head movement) without affecting content/identity, demonstrated across multiple datasets. 3. The evaluation is thorough across different metrics and datasets.
1. The paper assumes a strong assumption: videos decompose cleanly into content (which is static) and dynamic motion. However, many real-world vidoes don't fit this dichotomy. For example, deformable objects have constant but moving parts. The identity and motion are entangled. Lighting changes, and shadow moving cannot fit into these two categories either. I am concerned about the limited generalizability of the proposed method to broader video domains. 2. Performance on complex human motion (
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChaos-based Image/Signal Encryption · Advanced Optical Imaging Technologies
