M3L: Language-based Video Editing via Multi-Modal Multi-Level   Transformers

Tsu-Jui Fu; Xin Eric Wang; Scott T. Grafton; Miguel P. Eckstein,; William Yang Wang

arXiv:2104.01122·cs.CV·March 22, 2022

M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers

Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, Miguel P. Eckstein,, William Yang Wang

PDF

Open Access

TL;DR

This paper introduces M3L, a multi-modal transformer model for language-guided video editing that preserves source content while applying semantic changes based on natural language instructions.

Contribution

The paper proposes a novel multi-level transformer architecture for language-based video editing, enabling more accurate and controlled edits guided by natural language instructions.

Findings

01

M3L effectively performs language-guided video editing.

02

New datasets for evaluating language-based video editing.

03

LBVE opens new research directions in vision-and-language tasks.

Abstract

Video editing tools are widely used nowadays for digital design. Although the demand for these tools is high, the prior knowledge required makes it difficult for novices to get started. Systems that could follow natural language instructions to perform automatic editing would significantly improve accessibility. This paper introduces the language-based video editing (LBVE) task, which allows the model to edit, guided by text instruction, a source video into a target video. LBVE contains two features: 1) the scenario of the source video is preserved instead of generating a completely different video; 2) the semantic is presented differently in the target video, and all changes are controlled by the given instruction. We propose a Multi-Modal Multi-Level Transformer (M $^{3}$ L) to carry out LBVE. M $^{3}$ L dynamically learns the correspondence between video perception and language semantic at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Dropout · Byte Pair Encoding · Residual Connection · Layer Normalization · Label Smoothing · Adam