Hate Speech Detection in Roman Urdu

Moin Khan; Khurram Shahzad; Kamran Malik

arXiv:2108.02830·cs.CL·August 9, 2021

Hate Speech Detection in Roman Urdu

Moin Khan, Khurram Shahzad, Kamran Malik

PDF

TL;DR

This paper introduces a new dataset and classification approach for detecting hate speech in Roman Urdu tweets, addressing a low-resource language and demonstrating effective machine learning techniques.

Contribution

The study creates the first Roman Urdu hate speech corpus and compares multiple supervised learning methods, highlighting Logistic Regression's superior performance.

Findings

01

Logistic Regression achieved an F1 score of 0.906 for Neutral-Hostile classification.

02

The corpus contains 5,000 manually annotated Roman Urdu tweets.

03

Deep learning techniques did not outperform traditional methods in this context.

Abstract

Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLogistic Regression