Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs
Dabing Cheng, Haosen Zhan, Xingchen Zhao, Guisheng Liu, Zemin Li,, Jinghui Xie, Zhao Song, Weiguo Feng, Bingyue Peng

TL;DR
This paper presents a novel end-to-end framework using multimodal large language models for controllable, text-guided video editing, significantly improving efficiency and accuracy in short-video content creation.
Contribution
It introduces a new text-to-edit mechanism combined with a dense frame rate and slow-fast processing to enhance video understanding and editing control.
Findings
Effective in advertising datasets
Generalizes well to public datasets
Enhances video editing quality and controllability
Abstract
The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing, with challenges arising from the need to understand videos and tailor the editing according to user requirements. Addressing this need, we propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing. Leveraging the flexibility and generalizability of Multimodal Large Language Models (MLLMs), we defined clear input-output mappings for efficient video creation. To bolster the model's capability in processing and comprehending video content, we introduce a strategic combination of a denser frame rate and a slow-fast processing technique, significantly enhancing the extraction and understanding of both temporal and spatial video information. Furthermore, we introduce a text-to-edit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Artificial Intelligence in Games · Video Analysis and Summarization
