Strategies for Language Identification in Code-Mixed Low Resource Languages
Soumil Mandal, Sankalp Sanand

TL;DR
This paper introduces three resource-efficient strategies for language identification in code-mixed data, achieving over 92% accuracy with minimal data, advancing low-resource language processing.
Contribution
It presents novel low-resource methods for word-level language tagging in code-mixed data, outperforming baseline models.
Findings
Best system achieved 91% accuracy
Ensemble approach reached 92.6% accuracy
Strategies outperform existing low-resource methods
Abstract
In recent years, substantial work has been done on language tagging of code-mixed data, but most of them use large amounts of data to build their models. In this article, we present three strategies to build a word level language tagger for code-mixed data using very low resources. Each of them secured an accuracy higher than our baseline model, and the best performing system got an accuracy around 91%. Combining all, the ensemble system achieved an accuracy of around 92.6%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
