Video Ads Content Structuring by Combining Scene Confidence Prediction and Tagging
Tomoyuki Suzuki, Antonio Tejero-de-Pablos

TL;DR
This paper introduces a two-stage method for structuring video ads by detecting scene boundaries and tagging scenes using multimodal data, significantly improving accuracy on a challenging dataset.
Contribution
The paper presents a novel two-stage approach combining scene boundary detection with confidence scoring and multimodal scene tagging for video ads.
Findings
Improved segmentation accuracy over baselines
Effective use of multimodal data for scene tagging
Enhanced performance on Tencent Advertisement Video dataset
Abstract
Video ads segmentation and tagging is a challenging task due to two main reasons: (1) the video scene structure is complex and (2) it includes multiple modalities (e.g., visual, audio, text.). While previous work focuses mostly on activity videos (e.g. "cooking", "sports"), it is not clear how they can be leveraged to tackle the task of video ads content structuring. In this paper, we propose a two-stage method that first provides the boundaries of the scenes, and then combines a confidence score for each segmented scene and the tag classes predicted for that scene. We provide extensive experimental results on the network architectures and modalities used for the proposed method. Our combined method improves the previous baselines on the challenging "Tencent Advertisement Video" dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
