SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

Orchid Chetia Phukan; Mohd Mujtaba Akhtar; Girish; Swarup Ranjan Behera; Abu Osama Siddiqui; Sarthak Jain; Priyabrata Mallick; Jaya Sai Kiran Patibandla; Pailla Balakrishna Reddy; Arun Balaji Buduru; Rajesh Sharma

arXiv:2506.03378·eess.AS·June 5, 2025

SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

PDF

Open Access

TL;DR

SNIFR is a novel audio-visual alignment framework using cascaded cross-transformers that significantly improves fine-grained detection of harmful child content in videos, addressing evasion tactics and leveraging both modalities.

Contribution

Introduces SNIFR, a new transformer-based framework that effectively aligns audio and visual cues for enhanced harmful content detection in videos.

Findings

01

Achieves state-of-the-art performance in harmful content detection.

02

Outperforms unimodal and baseline fusion methods.

03

Effectively detects minimal unsafe content embedded in videos.

Abstract

As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Emotion and Mood Recognition · Multimodal Machine Learning Applications