Leveraging Language Identification to Enhance Code-Mixed Text   Classification

Gauri Takawane; Abhishek Phaltankar; Varad Patwardhan; Aryan Patil,; Raviraj Joshi; Mukta S. Takalikar

arXiv:2306.04964·cs.CL·June 9, 2023·2 cites

Leveraging Language Identification to Enhance Code-Mixed Text Classification

Gauri Takawane, Abhishek Phaltankar, Varad Patwardhan, Aryan Patil,, Raviraj Joshi, Mukta S. Takalikar

PDF

Open Access

TL;DR

This paper enhances BERT-based models for code-mixed Hindi-English text classification by incorporating language identification and augmentation techniques, significantly improving performance on various downstream tasks.

Contribution

It introduces a novel pipeline with language augmentation methods, including word-level interleaving and post-sentence placement, to improve code-mixed text classification.

Findings

01

Language augmentation improves model accuracy across tasks

02

Augmented models outperform vanilla BERT models

03

Effective on multiple downstream datasets

Abstract

The usage of more than one language in the same text is referred to as Code Mixed. It is evident that there is a growing degree of adaption of the use of code-mixed data, especially English with a regional language, on social media platforms. Existing deep-learning models do not take advantage of the implicit language information in the code-mixed text. Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English Datasets by experimenting with language augmentation approaches. We propose a pipeline to improve code-mixed systems that comprise data preprocessing, word-level language identification, language augmentation, and model training on downstream tasks like sentiment analysis. For language augmentation in BERT models, we explore word-level interleaving and post-sentence placement of language information. We have examined the performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques · Interpreting and Communication in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Attention Dropout · Linear Warmup With Linear Decay · Residual Connection · Layer Normalization · Adam