"Hinglish" Language -- Modeling a Messy Code-Mixed Language
Vivek Kumar Gupta

TL;DR
This paper develops a deep learning-based classifier for Hinglish social media content, effectively categorizing it into abusive, hate-inducing, or not offensive, using data augmentation techniques to improve performance.
Contribution
It introduces a novel deep learning approach with text augmentation for classifying Hinglish social content, achieving state-of-the-art results.
Findings
Outperforms previous models on the Hinglish dataset
Effective use of synonym replacement and random insertions for data augmentation
Achieves high accuracy in classifying offensive content
Abstract
With a sharp rise in fluency and users of "Hinglish" in linguistically diverse country, India, it has increasingly become important to analyze social content written in this language in platforms such as Twitter, Reddit, Facebook. This project focuses on using deep learning techniques to tackle a classification problem in categorizing social content written in Hindi-English into Abusive, Hate-Inducing and Not offensive categories. We utilize bi-directional sequence models with easy text augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion to produce a state of the art classifier that outperforms the previous work done on analyzing this dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques
