Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling
Adeep Hande, Karthik Puranik, Konthala Yasaswini, Ruba Priyadharshini,, Sajeetha Thavareesan, Anbukkarasi Sampath, Kogilavani Shanmugavadivel,, Durairaj Thenmozhi, Bharathi Raja Chakravarthi

TL;DR
This paper presents a method to identify offensive language in low-resource, code-mixed Dravidian languages by creating a pseudo-labeled dataset and fine-tuning pretrained language models, achieving improved classification performance.
Contribution
It introduces a novel pseudo-labeling approach and dataset construction for under-resourced Dravidian languages, enhancing offensive language detection accuracy.
Findings
ULMFiT fine-tuning achieved best results
Weighted F1-Score of 0.7934 on Tamil-English
Competitive scores on Malayalam-English and Kannada-English
Abstract
Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is an extensive need to screen them to protect individuals and groups from offensive content targeted at them. Our work intends to classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam. We intend to improve offensive language identification by generating pseudo-labels on the dataset. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language, either Kannada, Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The two datasets are combined using the generated pseudo-labels to create a custom dataset called CMTRA. As Dravidian languages are under-resourced,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Swearing, Euphemism, Multilingualism
MethodsDropout · Embedding Dropout · Activation Regularization · Sigmoid Activation · Tanh Activation · Weight Tying · Temporal Activation Regularization · DropConnect · Long Short-Term Memory · Variational Dropout
