ControlVideo: Training-free Controllable Text-to-Video Generation
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng, Zuo, Qi Tian

TL;DR
ControlVideo is a training-free framework that enables efficient, coherent, and long text-to-video generation by leveraging structural cues and novel modules to reduce flicker and appearance inconsistency.
Contribution
It introduces a training-free, modular approach for controllable text-to-video synthesis that improves coherence and reduces flickering without extensive training.
Findings
Outperforms state-of-the-art methods quantitatively and qualitatively.
Generates both short and long videos within minutes on a single GPU.
Effectively maintains appearance consistency and structural stability.
Abstract
Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that…
Peer Reviews
Decision·ICLR 2024 poster
• The proposed method is straightforward, easily implementable, and reproducible, making it accessible for further research and application. • The paper introduces novel techniques for long video generation, and the "interleaved-frame smoother" effectively improves frame consistency. • The results demonstrate improvements over existing methods, substantiating the paper's claims.
• While the full-attention mechanism and "interleaved-frame smoother" enhance frame consistency, they also significantly increase the computational time. • The background appears to flicker in relation to the foreground in some examples. For instance, in the "James Bond moonwalk on the beach, animation style" video on the provided website, the moon inconsistently appears and disappears. • The paper lacks quantitative comparisons with Text2Video-Zero in the context of pose conditions, which cou
- The paper is clearly written, well organized, and easy to follow. The symbols, terms, and concepts are adequately defined and explained. The language usage is good. - The proposed method is simple and easy to understand. Sufficient details are provided for the readers. - The experiments are generally well-executed. The empirical results show the effectiveness of the proposed method, showing certain advantages over state-of-the-art baselines.
- The qualitative results showcase certain advantages of the proposed method over state-of-the-art baselines in controllable text-to-video generation. However, by checking the provided video results, the temporal consistency can still be improved. Also, in some cases, the background looks unchanged. Some visual details can still be improved. Providing more discussions on these could strengthen this paper further. - The fully cross-frame interaction mechanism considers all frames as the referenc
- The writing is clear and easy to follow. - It is a training-free method, not relying on large-scale training, and has low computational resource requirements. - The ablation experiments are well-designed and easy to understand.
- Overall, the innovation is average; applying ControlNet to video editing or generation is straightforward and easily thought of. - The experiments are not comprehensive; there are too few baseline comparisons, and the experimental validation is limited to just over 20 examples, making the results less convincing. - Limited by the absence of structure condition, this method can mainly edit videos with similar motion. Its effectiveness diminishes for videos with different motions or poses.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Human Motion and Animation
MethodsDiffusion · Contrastive Language-Image Pre-training
