A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages
Saminu Mohammad Aliyu, Gregory Maksha Wajiga, Muhammad Murtala

TL;DR
This paper introduces new annotated datasets for offensive language detection in Hausa, Yoruba, and Igbo, using Twitter data and pre-trained models, achieving up to 90% accuracy, to aid multilingual offensive language detection research.
Contribution
It presents the first multilingual datasets for offensive language detection in Hausa, Yoruba, and Igbo, along with evaluation of pre-trained models on these datasets.
Findings
Best model achieved 90% accuracy.
Datasets and models will be publicly available.
Effective detection of offensive language in Nigerian languages.
Abstract
The proliferation of online offensive language necessitates the development of effective detection mechanisms, especially in multilingual contexts. This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90\%. To further support research in offensive language detection, we plan to make the dataset and our models publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Swearing, Euphemism, Multilingualism
