VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion   Models

Zhen Xing; Qi Dai; Zihao Zhang; Hui Zhang; Han Hu and; Zuxuan Wu; Yu-Gang Jiang

arXiv:2311.18837·cs.CV·December 1, 2023·2 cites

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han Hu and, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

VIDiff is a unified diffusion-based model capable of understanding and editing videos according to multi-modal instructions, achieving fast and consistent results across various video tasks.

Contribution

First unified diffusion model for diverse video understanding and editing tasks, enabling fast, instruction-driven video translation and enhancement.

Findings

01

Effective in editing and translating videos within seconds.

02

Achieves consistency in long video editing through an iterative auto-regressive method.

03

Produces high-quality, diverse video outputs aligned with user instructions.

Abstract

Diffusion models have achieved significant success in image and video generation. This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. However, most existing approaches only focus on video editing for short clips and rely on time-consuming tuning or inference. We are the first to propose Video Instruction Diffusion (VIDiff), a unified foundation model designed for a wide range of video tasks. These tasks encompass both understanding tasks (such as language-guided video object segmentation) and generative tasks (video editing and enhancement). Our model can edit and translate the desired results within seconds based on user instructions. Moreover, we design an iterative auto-regressive method to ensure consistency in editing and enhancing long videos. We provide convincing generative results for diverse input videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsDiffusion · Focus