An Annotated Corpus of Arabic Tweets for Hate Speech Analysis
Wajdi Zaghouani, Md. Rafiul Biswas

TL;DR
This paper presents a new annotated corpus of 10,000 Arabic tweets for hate speech detection, including multilabel annotations for various hate targets, and evaluates transformer models on this dataset.
Contribution
It introduces a comprehensive, annotated Arabic hate speech dataset with multilabel targets and provides baseline transformer model performance.
Findings
Inter-annotator agreement of 0.86 for offensive content
AraBERTv2 achieved a micro-F1 score of 0.7865
Dataset enables improved hate speech analysis in Arabic
Abstract
Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection
