TL;DR
This paper introduces CMFusion, a novel multimodal hate video detection model that effectively integrates text, audio, and video features through channel-wise and modality-wise fusion, significantly improving detection accuracy.
Contribution
The paper proposes a new fusion mechanism for multimodal hate video detection that captures temporal and modality interactions more effectively than existing methods.
Findings
CMFusion outperforms five baseline models in accuracy, precision, recall, and F1 score.
Ablation studies confirm the effectiveness of the fusion modules and temporal cross-attention.
The model demonstrates robustness across different parameter settings.
Abstract
The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
