Enhanced Multimodal Content Moderation of Children's Videos using   Audiovisual Fusion

Syed Hammad Ahmed; Muhammad Junaid Khan; Gita Sukthankar

arXiv:2405.06128·cs.CV·May 13, 2024

Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Syed Hammad Ahmed, Muhammad Junaid Khan, Gita Sukthankar

PDF

Open Access 1 Repo

TL;DR

This paper proposes an audiovisual fusion approach based on CLIP to improve content moderation of children's videos by incorporating audio cues alongside visual features, addressing limitations of unimodal systems.

Contribution

It introduces an efficient multimodal adaptation of CLIP that leverages audio and prompt learning for better detection of inappropriate content in children's videos.

Findings

01

Enhanced detection accuracy with audiovisual fusion

02

Effective in supervised and few-shot learning scenarios

03

Addresses gaps in unimodal content moderation systems

Abstract

Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

syedhammadahmed/mmob
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Speech and Audio Processing · Video Surveillance and Tracking Methods

MethodsContrastive Language-Image Pre-training