M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers
Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, Miguel P. Eckstein,, William Yang Wang

TL;DR
This paper introduces M3L, a multi-modal transformer model for language-guided video editing that preserves source content while applying semantic changes based on natural language instructions.
Contribution
The paper proposes a novel multi-level transformer architecture for language-based video editing, enabling more accurate and controlled edits guided by natural language instructions.
Findings
M3L effectively performs language-guided video editing.
New datasets for evaluating language-based video editing.
LBVE opens new research directions in vision-and-language tasks.
Abstract
Video editing tools are widely used nowadays for digital design. Although the demand for these tools is high, the prior knowledge required makes it difficult for novices to get started. Systems that could follow natural language instructions to perform automatic editing would significantly improve accessibility. This paper introduces the language-based video editing (LBVE) task, which allows the model to edit, guided by text instruction, a source video into a target video. LBVE contains two features: 1) the scenario of the source video is preserved instead of generating a completely different video; 2) the semantic is presented differently in the target video, and all changes are controlled by the given instruction. We propose a Multi-Modal Multi-Level Transformer (ML) to carry out LBVE. ML dynamically learns the correspondence between video perception and language semantic at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Dropout · Byte Pair Encoding · Residual Connection · Layer Normalization · Label Smoothing · Adam
