A Multimodal Framework for Video Ads Understanding

Zejia Weng; Lingchen Meng; Rui Wang; Zuxuan Wu; Yu-Gang Jiang

arXiv:2108.12868·cs.CV·August 31, 2021

A Multimodal Framework for Video Ads Understanding

Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper presents a multimodal framework for understanding video advertisements by combining scene segmentation and multi-modal tagging, utilizing visual, OCR, and ASR features to improve content analysis accuracy.

Contribution

The authors introduce a novel multimodal system that integrates visual, textual, and speech features for structured analysis of advertising videos, advancing automatic ad understanding.

Findings

01

Achieved a score of 0.2470 on the TAAC benchmark.

02

Ranked fourth in the 2021 TAAC competition.

03

Demonstrated effectiveness of combining visual, OCR, and ASR features.

Abstract

There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively. Taking the 2021 TAAC competition as an opportunity, we developed a multimodal system to improve the ability of structured analysis of advertising video content. In our framework, we break down the video structuring analysis problem into two tasks, i.e., scene segmentation and multi-modal tagging. In scene segmentation, we build upon a temporal convolution module for temporal modeling to predict whether adjacent frames belong to the same scene. In multi-modal tagging, we first compute clip-level visual features by aggregating frame-level features with NeXt-SoftDBoF. The visual features are further complemented with textual features that are derived using a global-local attention mechanism to extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsGlobal-Local Attention · Convolution