TL;DR
This paper introduces ImpliHateVid, a large-scale video dataset for implicit hate speech detection, and proposes a two-stage contrastive learning framework leveraging multimodal features to improve detection accuracy.
Contribution
The work presents the first large-scale dataset for implicit hate in videos and a novel two-stage contrastive learning approach for multimodal hate speech detection.
Findings
Effective detection of implicit hate speech in videos demonstrated
Multimodal contrastive learning improves detection accuracy
Dataset and method outperform existing approaches
Abstract
The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
