Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion
Syed Hammad Ahmed, Muhammad Junaid Khan, Gita Sukthankar

TL;DR
This paper proposes an audiovisual fusion approach based on CLIP to improve content moderation of children's videos by incorporating audio cues alongside visual features, addressing limitations of unimodal systems.
Contribution
It introduces an efficient multimodal adaptation of CLIP that leverages audio and prompt learning for better detection of inappropriate content in children's videos.
Findings
Enhanced detection accuracy with audiovisual fusion
Effective in supervised and few-shot learning scenarios
Addresses gaps in unimodal content moderation systems
Abstract
Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Speech and Audio Processing · Video Surveillance and Tracking Methods
MethodsContrastive Language-Image Pre-training
