Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward
Yolo Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng

TL;DR
This paper introduces M-SAN, an end-to-end multi-modal network for automatic ad video editing that improves segment assemblage quality by using importance-coherence rewards and a novel evaluation metric.
Contribution
The paper proposes M-SAN, a novel end-to-end model utilizing multi-modal features and a new reward function for improved ad video segment assemblage.
Findings
M-SAN outperforms previous methods on the Imp-Coh@Time metric.
Multi-modal representation enhances segment assemblage quality.
Importance-coherence reward significantly improves training effectiveness.
Abstract
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection
