Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

Dang H. Dang; Jelena Mitrovi; Michael Granitzer

arXiv:2604.09625·cs.CL·April 14, 2026

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

Dang H. Dang, Jelena Mitrovi, Michael Granitzer

PDF

TL;DR

This paper explores how web-scale unlabelled data and ensemble LLM annotations can enhance multilingual hate speech detection, especially benefiting smaller models and low-resource languages.

Contribution

It demonstrates that combining web data with LLM-generated synthetic labels improves hate speech detection, particularly for small models and low-resource languages.

Findings

01

Continued pre-training on web data improves macro-F1 by ~3%.

02

Ensemble LLM annotations boost small model performance by +11% F1.

03

LightGBM ensemble outperforms other synthetic annotation strategies.

Abstract

We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.