MM-HSD: Multi-Modal Hate Speech Detection in Videos

Berta C\'espedes-Sarrias; Carlos Collado-Capell; Pablo Rodenas-Ruiz; Olena Hrynenko; Andrea Cavallaro

arXiv:2508.20546·cs.MM·August 29, 2025

MM-HSD: Multi-Modal Hate Speech Detection in Videos

Berta C\'espedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, Andrea Cavallaro

PDF

TL;DR

This paper introduces MM-HSD, a multi-modal model for hate speech detection in videos that effectively integrates video, audio, and text modalities using Cross-Modal Attention, achieving state-of-the-art results on the HateMM dataset.

Contribution

It is the first to apply Cross-Modal Attention as an early feature extractor for multi-modal hate speech detection in videos, systematically analyzing modality interactions.

Findings

01

MM-HSD outperforms previous methods on M-F1 score (0.874).

02

Using on-screen text as a query improves detection performance.

03

Cross-Modal Attention effectively captures inter-modal dependencies.

Abstract

While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions…

Tables9

Table 1. Table 1. HSD in videos. KEY – T: text modality (speech transcript), A: audio modality (waveform), V: video modality (frames), O: on-screen text (text in the frame), M: video metadata, n/a: not applicable, MoE: Mixture of Experts, MPL: Multimodal Projection Layer in Llama 3.2-vision, CMA: cross-modal attention, CONCAT: concatenation fusion.

Reference	Modality					Fusion
Reference	T	A	V	O	M	Fusion
(Alcântara et al., 2020; Wu and Bhandary, 2020)	$∙$					n/a
(Kandakatla, 2016)					$∙$	n/a
(Wang et al., 2024; Das et al., 2023a; Koushik et al., 2025)	$∙$	$∙$	$∙$			CONCAT
(Maity et al., 2024; Koushik et al., 2025)	$∙$	$∙$	$∙$			CMA
(Lang et al., 2025)	$∙$	$∙$	$∙$		$∙$	MoE
(Wang et al., 2025b)	$∙$		$∙$			MPL
(Xiong et al., 2024)	$∙$	$∙$	$∙$	$∙$		Bimodal CMA
MM-HSD (ours)	$∙$	$∙$	$∙$	$∙$		CMA and CONCAT

Table 2. Table 2. Performance comparison of models trained on HateMM (Das et al., 2023a ) . Bold indicates the best performance and underline the second-best for each metric. We report our results over 5 runs as mean (std). Key — T: transcript, V: video, A: audio, O: on-screen text, M: macro average across both classes, H: hate class, F1: F1-score, ACC: unbiased accuracy, P: precision, R: recall.

Model

Architecture

T

V

A

O

ACC

M-F1

F1(H)

P(H)

R(H)

P(M)

R(M)

(Das et al., 2023a)

BERT, ViT, MFCC

∙

∙

∙

.798

.790

.749

.742

.758

–

(Koushik et al., 2025)

HXP, CLAP, CLIP

∙

∙

∙

.854

.848

–

.840

.800

(Wang et al., 2025b)

LLaMA-3.2-11B

∙

∙

.820

.800

.790

–

(Xiong et al., 2024)

BERT, ViT, wav2vec + OCR + CMA

∙

∙

∙

∙

.849

.840

.876

.857

.896

–

MM-HSD (ours)

Detoxify, ViT, wav2vec, OCR + CMA

∙

∙

∙

∙

.878 (.009)

.874 (.009)

.853 (.009)

.849 (.017)

.857 (.000)

.874 (.010)

.875 (.008)

Table 3. Table 3. Results for unimodal experiments models, CMA as a standalone feature extractor (CMA-S), CMA as an extra modality (MM-HSD), CMA as late fusion (CMA-LF) and modality-specific models fused with concatenation without CMA (w/o CMA). We report our results over 5 runs as mean (std). Key — T: transcript, V: video, A: audio, O: on-screen text, M: macro average across both classes, H: hate class, F1: F1-score, ACC: unbiased accuracy, P: precision, R: recall.

Model	ACC	M-F1	F1(H)	P(H)	R(H)
T	.820 (.012)	.816 (.012)	.790 (.012)	.765 (.019)	.816 (.009)
O	.636 (.014)	.594 (.011)	.464 (.012)	.596 (.032)	.381 (.016)
A	.784 (.019)	.778 (.018)	.742 (.018)	.739 (.039)	.746 (.030)
V	.761 (.027)	.751 (.024)	.702 (.020)	.730 (.055)	.679 (.017)
CMA-S^†	.850 (.006)	.846 (.006)	.820 (.006)	.818 (.016)	.821 (.008)
MM-HSD^†	.878 (.009)	.874 (.009)	.853 (.009)	.849 (.017)	.857 (.000)
w/o CMA	.846 (.013)	.842 (.014)	.817 (.019)	.805 (.028)	.832 (.052)
CMA-LF^†	.842 (.024)	.837 (.024)	.810 (.028)	.812 (.057)	.813 (.057)

Table 4. Table 4. Results for CMA as an extra modality (MM-HSD). We report our results over 5 runs as mean (std). Key – Mod: modalities, K: key, Q: query, T: transcript, O: on-screen text, A: audio, V: video, M: macro average across both classes, H: hate class, F1: F1-score, ACC: unbiased accuracy, P: precision, R: recall.

Mod.	K	Q	ACC	M-F1	F1(H)	P(H)	R(H)
TO	T	O	.830 (.006)	.825 (.005)	.796 (.006)	.793 (.014)	.800 (.014)
TA	T	A	.828 (.025)	.823 (.024)	.796 (.023)	.786 (.046)	.806 (.007)
TV	T	V	.841 (.006)	.837 (.006)	.811 (.006)	.799 (.009)	.822 (.007)
OA	A	O	.805 (.028)	.801 (.027)	.774 (.024)	.749 (.048)	.803 (.014)
OV	V	O	.775 (.009)	.768 (.010)	.730 (.014)	.726 (.005)	.733 (.026)
AV	A	V	.808 (.026)	.799 (0.030)	.759 (.041)	.788 (.023)	.733 (.065)
TOA	TO	A	.834 (.011)	.830 (.011)	.805 (.014)	.787 (.023)	.825 (.037)
TOV	TV	O	.838 (.019)	.834 (.019)	.807 (.022)	.800 (.035)	.816 (.040)
TVA	TV	A	.849 (.007)	.845 (.007)	.819 (.009)	.811 (.014)	.829 (.021)
OAV	OA	V	.821 (.023)	.815 (.026)	.781 (.035)	.789 (.013)	.775 (.059)
TOAV	TAV	O	.878 (.009)	.874 (.009)	.853 (.009)	.849 (.017)	.857 (.000)

Table 5. Table 5. Efficiency metrics for unimodal and multimodal models. Key – CMA used as standalone (CMA-S), additional modality (MM-HSD), late fusion (CMA-LF), and removed (w/o CMA), TTE: Train Time per Epoch, TTT: Total Train Time, TT: Test Time, Par: Parameters.

Model	TTE (s)	TTT (s)	TT (s)	# Par (M)	Size (MB)
A	0.540	73.162	0.046	0.147	0.562
T	0.441	65.818	0.058	0.123	0.470
O	0.426	18.075	0.050	0.123	0.470
V	0.462	31.427	0.041	1.279	4.880
CMA-S	1.124	155.917	0.068	2.953	11.266
MM-HSD	1.465	293.022	0.060	4.626	17.648
w/o CMA	0.975	70.013	0.065	1.673	6.381
CMA-LF	1.271	81.223	0.089	1.722	6.570

Table 6. Table 6. Detailed multi-modal results for the late fusion, using CMA as a fusion strategy. The features are concatenated if more than one modality is used for a key and value. Key – K: key, Q: query, T: transcript, O: on-screen text, A: audio, V: video, M: macro average across both classes, H: hate class, F1: F1-score, ACC: unbiased accuracy, P: precision, R: recall.

Modality	K	Q	ACC	M-F1	F1(H)	P(H)	R(H)
TO	O	T	0.658	0.630	0.527	0.617	0.460
TO	T	O	0.829	0.825	0.800	0.776	0.825
TA	A	T	0.816	0.808	0.770	0.797	0.746
TA	T	A	0.816	0.811	0.781	0.769	0.794
TV	V	T	0.737	0.733	0.701	0.662	0.746
TV	T	V	0.829	0.825	0.797	0.785	0.810
OA	A	O	0.816	0.812	0.785	0.761	0.810
OA	O	A	0.829	0.820	0.780	0.836	0.730
OV	V	O	0.743	0.735	0.688	0.694	0.683
OV	O	V	0.763	0.752	0.700	0.737	0.667
AV	V	A	0.816	0.807	0.767	0.807	0.730
AV	A	V	0.822	0.812	0.769	0.833	0.714
TOA	OA	T	0.796	0.791	0.760	0.742	0.778
	TA	O	0.829	0.824	0.794	0.794	0.794
	TO	A	0.809	0.807	0.785	0.736	0.841
TOV	OV	T	0.770	0.759	0.711	0.741	0.683
	TV	O	0.855	0.852	0.828	0.815	0.841
	TO	V	0.829	0.824	0.797	0.784	0.809
TVA	VA	T	0.796	0.785	0.735	0.796	0.682
	TV	A	0.835	0.829	0.797	0.817	0.778
	TA	V	0.842	0.839	0.818	0.783	0.857
OAV	AV	O	0.796	0.789	0.752	0.758	0.746
	OV	A	0.809	0.803	0.768	0.774	0.762
	OA	V	0.809	0.805	0.775	0.757	0.794
TOAV	OAV	T	0.789	0.786	0.758	0.725	0.794
	TAV	O	0.882	0.877	0.852	0.881	0.825
	TOV	A	0.882	0.879	0.862	0.836	0.889
	TOA	V	0.842	0.837	0.810	0.810	0.810

Table 7. Table 7. Detailed results for experiments using CMA as a fusion strategy on raw inputs. Key – K: key, Q: query, T: transcript, O: on-screen text, A: audio, V: video, M: macro average across both classes, H: hate class, F1: F1-score, ACC: unbiased accuracy, P: precision, R: recall.

Modality	K	Q	ACC	M-F1	F1(H)	P(H)	R(H)
TO	O	T	0.645	0.644	0.630	0.554	0.730
TO	T	O	0.829	0.825	0.800	0.776	0.825
TA	A	T	0.809	0.806	0.779	0.750	0.809
TA	T	A	0.829	0.825	0.797	0.785	0.810
TV	V	T	0.743	0.727	0.742	0.650	0.749
TV	T	V	0.836	0.832	0.809	0.779	0.841
OA	A	O	0.783	0.776	0.736	0.742	0.730
OA	O	A	0.632	0.631	0.622	0.541	0.730
OV	V	O	0.783	0.777	0.740	0.746	0.734
OV	O	V	0.651	0.651	0.634	0.561	0.730
AV	V	A	0.770	0.760	0.711	0.741	0.683
AV	A	V	0.796	0.789	0.752	0.758	0.746
TOA	OA	T	0.783	0.781	0.759	0.703	0.825
	TA	O	0.816	0.811	0.781	0.769	0.793
	TO	A	0.816	0.813	0.791	0.746	0.841
TOV	OV	T	0.776	0.770	0.730	0.730	0.730
	TV	O	0.842	0.839	0.818	0.783	0.857
	TO	V	0.829	0.825	0.800	0.776	0.825
TVA	VA	T	0.810	0.807	0.785	0.801	0.841
	TV	A	0.862	0.858	0.835	0.82813	0.841
	TA	V	0.816	0.812	0.785	0.761	0.810
OAV	AV	O	0.770	0.769	0.752	0.679	0.841
	OV	A	0.789	0.786	0.761	0.718	0.810
	OA	V	0.809	0.807	0.788	0.730	0.857
TOAV	OAV	T	0.822	0.820	0.797	0.757	0.841
	TAV	O	0.888	0.884	0.864	0.871	0.860
	TOV	A	0.855	0.851	0.825	0.851	0.825
	TOA	V	0.842	0.840	0.824	0.767	0.889

Table 8. Table 8. Effect of applying stopword removal to transcript and OCR modalities in MM-HSD. Key – M: macro average across both classes, H: hate class, F1: F1-score, ACC: unbiased accuracy, P: precision, R: recall.

Model	ACC	M-F1	F1(H)	P(H)	R(H)
MM-HSD	.878 (.009)	.874 (.009)	.853 (.009)	.849 (.017)	.857 (.000)
MM-HSD (removing stopwords)	.866 (.006)	.862 (.006)	.841 (.006)	.826 (.011)	.857 (.000)

Table 9. Table 9. Performance of using excluding modalities from CMA. OCR is kept as the query modality in all cases. Key – M: macro average across both classes, H: hate class, F1: F1-score.

Modality	M-F1	F1(H)
Audio only	0.870	0.848
Video only	0.864	0.841
Transcript only	0.855	0.832
Audio + Video	0.866	0.845
Audio + Transcript	0.859	0.834
Video + Transcript	0.861	0.837
MM-HSD (A+V+T)	0.874	0.853

Equations2

CMA (K, Q, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V,

CMA (K, Q, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

MM-HSD: Multi-Modal Hate Speech Detection in Videos

Berta Céspedes-Sarrias

EPFLLausanneSwitzerland