Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data

Faeze Ghorbanpour; Daryna Dementieva; Alexander Fraser

arXiv:2505.14272·cs.CL·May 27, 2025

Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data

Faeze Ghorbanpour, Daryna Dementieva, Alexander Fraser

PDF

Open Access 1 Video

TL;DR

This paper presents a data-efficient cross-lingual hate speech detection method that uses nearest-neighbor retrieval to augment limited labeled data, outperforming existing models across eight languages.

Contribution

It introduces a scalable, retrieval-based approach that enhances hate speech detection with minimal labeled data, surpassing state-of-the-art performance in low-resource settings.

Findings

01

Outperforms models trained only on target language data

02

Requires as few as 200 retrieved instances for effective performance

03

Scalable and adaptable to new languages and tasks

Abstract

Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data· underline

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques

MethodsSparse Evolutionary Training