A multilingual dataset for offensive language and hate speech detection   for hausa, yoruba and igbo languages

Saminu Mohammad Aliyu; Gregory Maksha Wajiga; Muhammad Murtala

arXiv:2406.02169·cs.CL·June 7, 2024·1 cites

A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages

Saminu Mohammad Aliyu, Gregory Maksha Wajiga, Muhammad Murtala

PDF

Open Access

TL;DR

This paper introduces new annotated datasets for offensive language detection in Hausa, Yoruba, and Igbo, using Twitter data and pre-trained models, achieving up to 90% accuracy, to aid multilingual offensive language detection research.

Contribution

It presents the first multilingual datasets for offensive language detection in Hausa, Yoruba, and Igbo, along with evaluation of pre-trained models on these datasets.

Findings

01

Best model achieved 90% accuracy.

02

Datasets and models will be publicly available.

03

Effective detection of offensive language in Nigerian languages.

Abstract

The proliferation of online offensive language necessitates the development of effective detection mechanisms, especially in multilingual contexts. This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90\%. To further support research in offensive language detection, we plan to make the dataset and our models publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Swearing, Euphemism, Multilingualism