Hate Speech Detection in Roman Urdu
Moin Khan, Khurram Shahzad, Kamran Malik

TL;DR
This paper introduces a new dataset and classification approach for detecting hate speech in Roman Urdu tweets, addressing a low-resource language and demonstrating effective machine learning techniques.
Contribution
The study creates the first Roman Urdu hate speech corpus and compares multiple supervised learning methods, highlighting Logistic Regression's superior performance.
Findings
Logistic Regression achieved an F1 score of 0.906 for Neutral-Hostile classification.
The corpus contains 5,000 manually annotated Roman Urdu tweets.
Deep learning techniques did not outperform traditional methods in this context.
Abstract
Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLogistic Regression
