A Multimodal Framework for Video Ads Understanding
Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang

TL;DR
This paper presents a multimodal framework for understanding video advertisements by combining scene segmentation and multi-modal tagging, utilizing visual, OCR, and ASR features to improve content analysis accuracy.
Contribution
The authors introduce a novel multimodal system that integrates visual, textual, and speech features for structured analysis of advertising videos, advancing automatic ad understanding.
Findings
Achieved a score of 0.2470 on the TAAC benchmark.
Ranked fourth in the 2021 TAAC competition.
Demonstrated effectiveness of combining visual, OCR, and ASR features.
Abstract
There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively. Taking the 2021 TAAC competition as an opportunity, we developed a multimodal system to improve the ability of structured analysis of advertising video content. In our framework, we break down the video structuring analysis problem into two tasks, i.e., scene segmentation and multi-modal tagging. In scene segmentation, we build upon a temporal convolution module for temporal modeling to predict whether adjacent frames belong to the same scene. In multi-modal tagging, we first compute clip-level visual features by aggregating frame-level features with NeXt-SoftDBoF. The visual features are further complemented with textual features that are derived using a global-local attention mechanism to extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsGlobal-Local Attention · Convolution
