# MM-HSD: Multi-Modal Hate Speech Detection in Videos

**Authors:** Berta C\'espedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, Andrea Cavallaro

arXiv: 2508.20546 · 2025-08-29

## TL;DR

This paper introduces MM-HSD, a multi-modal model for hate speech detection in videos that effectively integrates video, audio, and text modalities using Cross-Modal Attention, achieving state-of-the-art results on the HateMM dataset.

## Contribution

It is the first to apply Cross-Modal Attention as an early feature extractor for multi-modal hate speech detection in videos, systematically analyzing modality interactions.

## Key findings

- MM-HSD outperforms previous methods on M-F1 score (0.874).
- Using on-screen text as a query improves detection performance.
- Cross-Modal Attention effectively captures inter-modal dependencies.

## Abstract

While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20546/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20546/full.md

## References

74 references — full list in the complete paper: https://tomesphere.com/paper/2508.20546/full.md

---
Source: https://tomesphere.com/paper/2508.20546