Offensive Language Identification in Low-resourced Code-mixed Dravidian   languages using Pseudo-labeling

Adeep Hande; Karthik Puranik; Konthala Yasaswini; Ruba Priyadharshini,; Sajeetha Thavareesan; Anbukkarasi Sampath; Kogilavani Shanmugavadivel,; Durairaj Thenmozhi; Bharathi Raja Chakravarthi

arXiv:2108.12177·cs.CL·August 30, 2021·5 cites

Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Adeep Hande, Karthik Puranik, Konthala Yasaswini, Ruba Priyadharshini,, Sajeetha Thavareesan, Anbukkarasi Sampath, Kogilavani Shanmugavadivel,, Durairaj Thenmozhi, Bharathi Raja Chakravarthi

PDF

Open Access 1 Repo

TL;DR

This paper presents a method to identify offensive language in low-resource, code-mixed Dravidian languages by creating a pseudo-labeled dataset and fine-tuning pretrained language models, achieving improved classification performance.

Contribution

It introduces a novel pseudo-labeling approach and dataset construction for under-resourced Dravidian languages, enhancing offensive language detection accuracy.

Findings

01

ULMFiT fine-tuning achieved best results

02

Weighted F1-Score of 0.7934 on Tamil-English

03

Competitive scores on Malayalam-English and Kannada-English

Abstract

Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is an extensive need to screen them to protect individuals and groups from offensive content targeted at them. Our work intends to classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam. We intend to improve offensive language identification by generating pseudo-labels on the dataset. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language, either Kannada, Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The two datasets are combined using the generated pseudo-labels to create a custom dataset called CMTRA. As Dravidian languages are under-resourced,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adeeph/dravidian-oli
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Swearing, Euphemism, Multilingualism

MethodsDropout · Embedding Dropout · Activation Regularization · Sigmoid Activation · Tanh Activation · Weight Tying · Temporal Activation Regularization · DropConnect · Long Short-Term Memory · Variational Dropout