Towards Training-free Multimodal Hate Localisation with Large Language Models
Yueming Sun, Long Yang, Jianbo Jiao, Zeyu Fu

TL;DR
This paper introduces LELA, a training-free framework using large language models and multimodal analysis to detect and localize hate speech in videos with high precision, outperforming existing methods.
Contribution
LELA is the first training-free multimodal hate localization framework leveraging LLMs and modality-specific captioning for fine-grained temporal detection.
Findings
LELA outperforms all training-free baselines on HateMM and MultiHateClip benchmarks.
The method effectively decomposes videos into five modalities for detailed analysis.
Extensive ablations validate the robustness and interpretability of LELA.
Abstract
The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Generative Adversarial Networks and Image Synthesis · Emotion and Mood Recognition
