Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward

Yolo Yunlong Tang; Siting Xu; Teng Wang; Qin Lin; Qinglin Lu; Feng Zheng

arXiv:2209.12164·cs.CV·October 9, 2025

Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward

Yolo Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces M-SAN, an end-to-end multi-modal network for automatic ad video editing that improves segment assemblage quality by using importance-coherence rewards and a novel evaluation metric.

Contribution

The paper proposes M-SAN, a novel end-to-end model utilizing multi-modal features and a new reward function for improved ad video segment assemblage.

Findings

01

M-SAN outperforms previous methods on the Imp-Coh@Time metric.

02

Multi-modal representation enhances segment assemblage quality.

03

Importance-coherence reward significantly improves training effectiveness.

Abstract

Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunlong10/ads-1k
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection