Leveraging Language Identification to Enhance Code-Mixed Text Classification
Gauri Takawane, Abhishek Phaltankar, Varad Patwardhan, Aryan Patil,, Raviraj Joshi, Mukta S. Takalikar

TL;DR
This paper enhances BERT-based models for code-mixed Hindi-English text classification by incorporating language identification and augmentation techniques, significantly improving performance on various downstream tasks.
Contribution
It introduces a novel pipeline with language augmentation methods, including word-level interleaving and post-sentence placement, to improve code-mixed text classification.
Findings
Language augmentation improves model accuracy across tasks
Augmented models outperform vanilla BERT models
Effective on multiple downstream datasets
Abstract
The usage of more than one language in the same text is referred to as Code Mixed. It is evident that there is a growing degree of adaption of the use of code-mixed data, especially English with a regional language, on social media platforms. Existing deep-learning models do not take advantage of the implicit language information in the code-mixed text. Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English Datasets by experimenting with language augmentation approaches. We propose a pipeline to improve code-mixed systems that comprise data preprocessing, word-level language identification, language augmentation, and model training on downstream tasks like sentiment analysis. For language augmentation in BERT models, we explore word-level interleaving and post-sentence placement of language information. We have examined the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques · Interpreting and Communication in Healthcare
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Attention Dropout · Linear Warmup With Linear Decay · Residual Connection · Layer Normalization · Adam
