"Hinglish" Language -- Modeling a Messy Code-Mixed Language

Vivek Kumar Gupta

arXiv:1912.13109·cs.CL·January 1, 2020·5 cites

"Hinglish" Language -- Modeling a Messy Code-Mixed Language

Vivek Kumar Gupta

PDF

Open Access

TL;DR

This paper develops a deep learning-based classifier for Hinglish social media content, effectively categorizing it into abusive, hate-inducing, or not offensive, using data augmentation techniques to improve performance.

Contribution

It introduces a novel deep learning approach with text augmentation for classifying Hinglish social content, achieving state-of-the-art results.

Findings

01

Outperforms previous models on the Hinglish dataset

02

Effective use of synonym replacement and random insertions for data augmentation

03

Achieves high accuracy in classifying offensive content

Abstract

With a sharp rise in fluency and users of "Hinglish" in linguistically diverse country, India, it has increasingly become important to analyze social content written in this language in platforms such as Twitter, Reddit, Facebook. This project focuses on using deep learning techniques to tackle a classification problem in categorizing social content written in Hindi-English into Abusive, Hate-Inducing and Not offensive categories. We utilize bi-directional sequence models with easy text augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion to produce a state of the art classifier that outperforms the previous work done on analyzing this dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques