AfriHate: A Multilingual Collection of Hate Speech and Abusive Language   Datasets for African Languages

Shamsuddeen Hassan Muhammad; Idris Abdulmumin; Abinew Ali Ayele; David; Ifeoluwa Adelani; Ibrahim Said Ahmad; Saminu Mohammad Aliyu; Nelson Odhiambo; Onyango; Lilian D. A. Wanzare; Samuel Rutunda; Lukman Jibril Aliyu; Esubalew; Alemneh; Oumaima Hourrane; Hagos Tesfahun Gebremichael; Elyas Abdi Ismail,; Meriem Beloucif; Ebrahim Chekol Jibril; Andiswa Bukula; Rooweither Mabuya,; Salomey Osei; Abigail Oppong; Tadesse Destaw Belay; Tadesse Kebede Guge,; Tesfa Tegegne Asfaw; Chiamaka Ijeoma Chukwuneke; Paul R\"ottger; Seid Muhie; Yimam; Nedjma Ousidhoum

arXiv:2501.08284·cs.CL·January 16, 2025

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David, Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Nelson Odhiambo, Onyango, Lilian D. A. Wanzare, Samuel Rutunda, Lukman Jibril Aliyu, Esubalew, Alemneh, Oumaima Hourrane

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AfriHate, a multilingual dataset of hate speech and abusive language in 15 African languages, created with native speaker annotations to improve understanding and moderation of such content in the region.

Contribution

The paper provides the first high-quality, culturally-aware hate speech datasets for 15 African languages, addressing data scarcity and moderation challenges.

Findings

01

Baseline classification results demonstrate the dataset's utility.

02

Including LLMs improves hate speech detection accuracy.

03

Challenges in dataset construction are discussed.

Abstract

Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

afrihate/afrihate
noneOfficial

Videos

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages· underline

Taxonomy

TopicsHate Speech and Cyberbullying Detection