I4VGen: Image as Free Stepping Stone for Text-to-Video Generation
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Liefeng Bo, Di Huang

TL;DR
I4VGen introduces a two-stage inference pipeline that leverages advanced image techniques to enhance pre-trained text-to-video diffusion models, significantly improving video realism and fidelity without additional training.
Contribution
The paper presents I4VGen, a novel inference method that enhances existing text-to-video models using anchor image synthesis and a new noise-invariant score distillation technique, without extra training.
Findings
Produces more realistic and faithful videos
Supports seamless integration with existing models
Improves video quality significantly
Abstract
Text-to-video generation has trailed behind text-to-image generation in terms of quality and diversity, primarily due to the inherent complexities of spatio-temporal modeling and the limited availability of video-text datasets. Recent text-to-video diffusion models employ the image as an intermediate step, significantly enhancing overall performance but incurring high training costs. In this paper, we present I4VGen, a novel video diffusion inference pipeline to leverage advanced image techniques to enhance pre-trained text-to-video diffusion models, which requires no additional training. Instead of the vanilla text-to-video inference pipeline, I4VGen consists of two stages: anchor image synthesis and anchor image-augmented text-to-video synthesis. Correspondingly, a simple yet effective generation-selection strategy is employed to achieve visually-realistic and semantically-faithful…
Peer Reviews
Decision·Submitted to ICLR 2025
+ Two-stage text-to-Video Generation method I4VGEN. + Can be integrated into existing image-to-video diffusion models.
- The integration with existing image-to-video diffusion models is interesting, but the authors are suggested to combined with more I2V models, especially several recent ones. - More ablation studies are required to show whether the anchor image selection and the NI-VSDS are optimal. - In the bottom of Fig. 6, albeit better image quality, it seems that the motion of the proposed method is smaller than that by SparseCtrl. More experiments are suggested to assess this aspect.
1. The proposed method uses a pre-trained image generation model to improve frame quality in text-to-video generation, which is helpful for high-quality video generation. 2. The presented results demonstrate good quality.
1. The proposed method appears to integrate the T2I model with SDS distillation for video generation, and the contribution seems incremental. 2. The motion observed in Fig. 1 appears to be smaller compared to the baselines. 3. There is a lack of analysis for different regeneration steps.
1. This paper introduces a training-free pipeline called I4VGen to improve the performance of text-to-video diffusion models throught image reference information. 2. A simple yet effective generation-selection strategy is proposed to obtain high-quality-images, while a noise-invariant video score distillation sampling is introduced for image animation. 3. Extensive experiments show that the proposed method comsiderably outperforms the performance of video diffusion baselines in terms of video qu
1. The technical contributions of the paper are somewhat limited. The proposed noise-invariant video score distillation only modifies some hyper-parameters of the original SDS techinque. 2. Compared to the baseline results, the video actions enhanced using the proposed method in this paper are minimal or essentially stationary. The metrics in Table 1 also show that the proposed method heavily harm the dynamic degree of generated videos. 3. AnimateDiff relies on high-quality LoRAs to improve the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Video Analysis and Summarization · Human Motion and Animation
MethodsDiffusion
