Towards Language-Driven Video Inpainting via Multimodal Large Language   Models

Jianzong Wu; Xiangtai Li; Chenyang Si; Shangchen Zhou; Jingkang Yang,; Jiangning Zhang; Yining Li; Kai Chen; Yunhai Tong; Ziwei Liu; Chen Change Loy

arXiv:2401.10226·cs.CV·October 2, 2024·2 cites

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang,, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces a novel language-driven video inpainting task, supported by a new dataset and a diffusion-based framework that leverages multimodal large language models to perform complex, instruction-guided video editing.

Contribution

It presents the first end-to-end framework for language-guided video inpainting and introduces the ROVI dataset for training and evaluation.

Findings

01

The framework effectively understands and executes complex language instructions.

02

The ROVI dataset enables diverse inpainting scenarios.

03

The approach outperforms traditional mask-based methods.

Abstract

We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jianzongwu/language-driven-video-inpainting
pytorchOfficial

Models

Datasets

jianzongwu/rovi
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

MethodsInpainting