MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

Qiyue Sun; Tailin Chen; Yinghui Zhang; Yuchen Zhang; Jiangbei Yue; Jianbo Jiao; Zeyu Fu

arXiv:2512.10408·cs.CV·January 30, 2026

MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

Qiyue Sun, Tailin Chen, Yinghui Zhang, Yuchen Zhang, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

PDF

Open Access

TL;DR

This paper introduces MultiHateLoc, a novel weakly-supervised framework for localising multimodal hate speech in online videos, effectively capturing temporal and cross-modal dynamics to produce fine-grained, interpretable predictions.

Contribution

MultiHateLoc is the first framework to address weakly-supervised temporal localisation of multimodal hate speech, integrating modality-aware encoders, dynamic fusion, and contrastive alignment.

Findings

01

Achieves state-of-the-art localisation performance on HateMM and MultiHateClip datasets.

02

Effectively models heterogeneous temporal patterns across modalities.

03

Produces fine-grained, interpretable frame-level predictions.

Abstract

The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Emotion and Mood Recognition · Generative Adversarial Networks and Image Synthesis