OPSD: an Offensive Persian Social media Dataset and its baseline evaluations
Mehran Safayani, Amir Sartipi, Amir Hossein Ahmadi, Parniyan Jalali,, Amir Hossein Mansouri, Mohammad Bisheh-Niasar, Zahra Pourbahman

TL;DR
This paper introduces two Persian offensive language datasets, one annotated by experts and one unlabeled, and evaluates baseline performance of modern language models on offensive speech detection in Persian social media.
Contribution
It provides the first Persian offensive language datasets with expert annotations and baseline evaluations using state-of-the-art models, filling a significant resource gap.
Findings
XLM-RoBERTa achieved 76.9% F1-score on three-class classification.
XLM-RoBERTa achieved 89.9% F1-score on binary classification.
The datasets enable future research in Persian offensive language detection.
Abstract
The proliferation of hate speech and offensive comments on social media has become increasingly prevalent due to user activities. Such comments can have detrimental effects on individuals' psychological well-being and social behavior. While numerous datasets in the English language exist in this domain, few equivalent resources are available for Persian language. To address this gap, this paper introduces two offensive datasets. The first dataset comprises annotations provided by domain experts, while the second consists of a large collection of unlabeled data obtained through web crawling for unsupervised learning purposes. To ensure the quality of the former dataset, a meticulous three-stage labeling process was conducted, and kappa measures were computed to assess inter-annotator agreement. Furthermore, experiments were performed on the dataset using state-of-the-art language models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTerrorism, Counterterrorism, and Political Violence · Hate Speech and Cyberbullying Detection · Network Security and Intrusion Detection
