Dreamix: Video Diffusion Models are General Video Editors
Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi, Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen

TL;DR
Dreamix introduces a novel diffusion-based method for text-driven video editing, enabling high-fidelity, flexible modifications of general videos with improved motion editability and applications in image animation and subject-driven generation.
Contribution
The paper presents the first diffusion-based approach for text-driven video editing, including a new finetuning process, a mixed objective for motion editability, and a framework for image animation.
Findings
Achieves high-fidelity video editing aligned with text prompts
Outperforms baseline methods in qualitative and quantitative evaluations
Enables versatile applications like image animation and subject-driven video generation
Abstract
Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt. As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity. We propose to improve motion editability by a new, mixed objective that jointly finetunes…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The model can accomplish multiple video editing tasks and show great visualized results. 2. The fine-tuning strategy enhances the editing capabilities of VDM. 3. The paper is well-structured, capable of clearly elucidating its core ideas.
1. This paper conducted editing experiments on a single VDM base model, making it difficult to ascertain whether the proposed method is applicable to other VDMs e.g., modelscope or if the observed results are solely due to the characteristics of the base model. This is somewhat inconsistent with the title of the paper. 2. The comparisons with Tune-A-Video are not fair. Tune-A-Video is finetuned on image diffusion model but dreamix is finetuned on video diffusion model. 3. When compared wit
1. The proposed method can achieve various video editing and generation applications. 2. Extensive ablation studies are conducted to demonstrate the effectiveness of each component.
1. The comparison with previous works is not comprehensive. For text-based video editing, FateZero [ref-1] and TokenFlow [ref-2] are more advanced methods designed for video editing. For subject-driven video generation, Animatediff [ref-3] can learn a Lora model from several images and generate corresponding videos. For animating a single image, VideoComposer [ref-4] can generate video conditioned on the single image and text, which does not require additional transformation. These methods shou
1. the proposed framework can be applied top multi-tasks like Video Editing, Image-driven Videos and Subject-driven Video Generation. 2. the Mixed Video-Image Finetuning is sound and with reasonable performance 3. The paper presents extensive experiments that demonstrate the ability of Dreamix.
1. although the Mixed Video-Image Finetuning strategy is sound, the overall technical novelty is kind limited. it is an application of VDMs with sophisticated strategy during finetuing. Also, it would be interesting to see how much would the base VDM would effect the finetuing results. 2. although the paper claims high fidelity and quality for video editing, the resolutions are still low and with blurred details. It looks like the input video/image provides layout information and the details
Motion edition is a challenging and important topic. The authors propose a systematic solution for text-driven motion edition based on video diffusion models, which has a positive impact on the entire community. The authors present a good number of experiments validating the effectiveness of their approach and demonstrate excellent performance.
1. Limited fidelity to original videos. Ensuring the fidelity of the original video is a thorny issue. Also, in the presentation: **Spatially** The background has become smooth and appears to be peeling (ref. Figure 7). The authors need to explain what efforts the authors have made to address this issue. **Temporally** Lack of integration of action semantics and temporal modeling, which leads to the incongruity of target motions. *e.g.,* the transition from eating to dancing in Figure 1. 2. T
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsALIGN · Dreamix: video diffusion models are general video editors · Diffusion
