InstructVid2Vid: Controllable Video Editing with Natural Language   Instructions

Bosheng Qin; Juncheng Li; Siliang Tang; Tat-Seng Chua; Yueting Zhuang

arXiv:2305.12328·cs.CV·May 30, 2024·2 cites

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang

PDF

Open Access

TL;DR

InstructVid2Vid is a diffusion-based video editing method that allows natural language-guided modifications without fine-tuning, producing coherent and diverse videos efficiently.

Contribution

The paper introduces a novel end-to-end diffusion approach that enables controllable video editing guided by natural language instructions, using a new training dataset and coherence loss.

Findings

01

Produces high-quality, temporally coherent videos

02

Enables diverse edits like attribute, background, and style transfer

03

Eliminates need for per-example fine-tuning

Abstract

We introduce InstructVid2Vid, an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion. The proposed InstructVid2Vid model modifies a pretrained image generation model, Stable Diffusion, to generate a time-dependent sequence of video frames. By harnessing the collective intelligence of disparate models, we engineer a training dataset rich in video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To enhance the coherence between successive frames within the generated videos, we propose the Inter-Frames Consistency Loss and incorporate it during the training process. With multimodal classifier-free guidance during the inference stage, the generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Concatenated Skip Connection · Diffusion · Convolution · BLIP: Bootstrapping Language-Image Pre-training · U-Net