VEnhancer: Generative Space-Time Enhancement for Video Generation
Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin,, Yu Qiao, Wanli Ouyang, Ziwei Liu

TL;DR
VEnhancer is a unified generative framework that enhances AI-generated videos by increasing spatial and temporal resolution, removing artifacts, and improving detail, leading to state-of-the-art results in video super-resolution and generation benchmarks.
Contribution
VEnhancer introduces a novel space-time enhancement method using a pretrained video diffusion model and a video ControlNet, enabling arbitrary up-sampling and artifact removal in generated videos.
Findings
Surpasses existing video super-resolution methods
Achieves top performance on VBench benchmark
Effectively removes spatial artifacts and temporal flickering
Abstract
We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to beโฆ
Peer Reviews
DecisionยทSubmitted to ICLR 2025
1. Novelty: The paper presents a unified approach for generative spatial and temporal super-resolution, which is novel in the field of video generation. The integration of a pretrained generative video prior with a ST-Controller for conditioning is a creative solution that addresses the limitations of cascaded models. The concept of space-time data augmentation and video-aware conditioning is innovative and contributes to the training of the ST-Controller in an end-to-end manner. 2. Quality: Th
1. Some expressions of the paper are not clear and rigorous enough, and there are certain ambiguities, ambiguities or even errors. Below are the instances: - Wrong notations, instead of ๐ผ^(1:๐:๐), ๐ผ^(1:๐:๐) is typically used to denote a sequence starting from 1, ending at ๐, with a step size of ๐. Similar cases for z, t, \sigma, and s. Besides, in Fig.3 z^{1:m}_t, t^{1:m} should be z^{1:f}_t, t^{1:f} - Wrong illustrations. In Fig.2 Space-Time Data Augmentation part, both the noised videos (with
- Handles multiple up-scaling in both space and time dimensions, allowing for versatile application scenarios. - Improves the fidelity of generated videos while maintaining or enhancing detail, demonstrated through rigorous testing against current top methods.
- As with most diffusion models, the complexity of the model could lead to longer inference times, which may limit its applicability in real-time or low-resource scenarios. - The model's performance heavily relies on the availability of high-quality training data, which might not always be available or feasible to collect in certain domains. - For each different text-to-video model, a new VEnhancer need to be trained to accommodated the different architecture, limiting the use case of the propos
1. The proposed approach unifies the space-time video super-resolution through re-sampling operation over time-space axes, achieving multi-function in a single video diffusion model. 2. Both of the objective video quality evaluation and subjective human evaluation show the effectiveness of the proposed approach in video enhancement. 3. The paper is well-written and technical description is clear.
1. One of my key concerns about the technical novelty is about the involving of the UNet-based control branch whose weights are copied from the original I2VGen-XL. The novelty seems limited since the controling approach has been proposed by ControlNet and the key frame information integration via feature summation is also intuitive. 2. The technical design of space-time data augmentation should be investigated quantitatively in the main paper. Besides, the motivation of the position embedding e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging ยท Computer Graphics and Visualization Techniques ยท Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
