Strategies for Language Identification in Code-Mixed Low Resource   Languages

Soumil Mandal; Sankalp Sanand

arXiv:1810.07156·cs.CL·November 2, 2018

Strategies for Language Identification in Code-Mixed Low Resource Languages

Soumil Mandal, Sankalp Sanand

PDF

Open Access

TL;DR

This paper introduces three resource-efficient strategies for language identification in code-mixed data, achieving over 92% accuracy with minimal data, advancing low-resource language processing.

Contribution

It presents novel low-resource methods for word-level language tagging in code-mixed data, outperforming baseline models.

Findings

01

Best system achieved 91% accuracy

02

Ensemble approach reached 92.6% accuracy

03

Strategies outperform existing low-resource methods

Abstract

In recent years, substantial work has been done on language tagging of code-mixed data, but most of them use large amounts of data to build their models. In this article, we present three strategies to build a word level language tagger for code-mixed data using very low resources. Each of them secured an accuracy higher than our baseline model, and the best performing system got an accuracy around 91%. Combining all, the ensemble system achieved an accuracy of around 92.6%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification