Abusive and Threatening Language Detection in Urdu using Boosting based   and BERT based models: A Comparative Approach

Mithun Das; Somnath Banerjee; Punyajoy Saha

arXiv:2111.14830·cs.CL·December 1, 2021·5 cites

Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach

Mithun Das, Somnath Banerjee, Punyajoy Saha

PDF

Open Access 1 Repo

TL;DR

This paper compares boosting and BERT-based models for detecting abusive and threatening language in Urdu, demonstrating that a Transformer model trained on Arabic data performs best, achieving top scores in a shared task.

Contribution

It introduces a comparative analysis of machine learning models for Urdu abusive language detection, highlighting the effectiveness of Transformer models trained on related languages.

Findings

01

Transformer model trained on Arabic data outperforms others

02

Achieved first place in shared task for abusive language detection

03

F1 scores of 0.88 for abusive and 0.54 for threatening content

Abstract

Online hatred is a growing concern on many social media platforms. To address this issue, different social media platforms have introduced moderation policies for such content. They also employ moderators who can check the posts violating moderation policies and take appropriate action. Academicians in the abusive language research domain also perform various studies to detect such content better. Although there is extensive research in abusive language detection in English, there is a lacuna in abusive language detection in low resource languages like Hindi, Urdu etc. In this FIRE 2021 shared task - "HASOC- Abusive and Threatening language detection in Urdu" the organizers propose an abusive language detection dataset in Urdu along with threatening language detection. In this paper, we explored several machine learning models such as XGboost, LGBM, m-BERT based models for abusive and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hate-alert/urduabuseandthreat
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Label Smoothing · Dense Connections · Absolute Position Encodings · Softmax · Residual Connection