AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang

TL;DR
AutoCut is an end-to-end framework that leverages multimodal discretization and controllable generation to streamline advertisement video editing, reducing costs and improving consistency.
Contribution
It introduces a unified multimodal token space and a large language model for comprehensive, controllable advertisement video editing within a single pipeline.
Findings
Reduces production cost and iteration time.
Enhances consistency and controllability of videos.
Supports diverse editing tasks like script and music selection.
Abstract
Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
