Arabic Offensive Language on Twitter: Analysis and Experiments

Hamdy Mubarak; Ammar Rashed; Kareem Darwish; Younes Samih; Ahmed; Abdelali

arXiv:2004.02192·cs.CL·March 11, 2021·87 cites

Arabic Offensive Language on Twitter: Analysis and Experiments

Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, Ahmed, Abdelali

PDF

Open Access 1 Datasets

TL;DR

This paper presents the creation of the largest Arabic offensive tweet dataset, analyzes linguistic and demographic factors, and demonstrates high-performance offensive language detection using state-of-the-art methods.

Contribution

It introduces a large, unbiased Arabic offensive tweet dataset with detailed annotations and provides comprehensive analysis and strong baseline results for offensive language detection.

Findings

01

Offensive tweets are more prevalent in certain dialects and topics.

02

The dataset includes detailed tags for vulgarity and hate speech.

03

State-of-the-art models achieve an F1 score of 83.2 on this dataset.

Abstract

Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building a large Arabic offensive tweet dataset. We introduce a method for building a dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech. We thoroughly analyze the dataset to determine which topics, dialects, and gender are most associated with offensive tweets and how Arabic speakers use offensive language. Lastly, we conduct many experiments to produce strong results (F1 = 83.2) on the dataset using SOTA techniques.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

strombergnlp/offenseval_2020
dataset· 405 dl
405 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Swearing, Euphemism, Multilingualism · Spam and Phishing Detection