Language Identification of Bengali-English Code-Mixed data using   Character & Phonetic based LSTM Models

Soumil Mandal; Sourya Dipta Das; Dipankar Das

arXiv:1803.03859·cs.CL·June 28, 2018

Language Identification of Bengali-English Code-Mixed data using Character & Phonetic based LSTM Models

Soumil Mandal, Sourya Dipta Das, Dipankar Das

PDF

Open Access

TL;DR

This paper introduces LSTM-based models for identifying languages in Bengali-English social media text, effectively handling code-mixing and phonetic transliterations, achieving over 92% accuracy.

Contribution

It proposes novel character and phonetic encoding methods combined with ensemble techniques for improved language identification in low-resource code-mixed data.

Findings

01

Achieved 91.78% and 92.35% accuracy with ensemble models.

02

Demonstrated effectiveness of phonetic and character-based encodings.

03

Addressed challenges of code-mixing and transliteration in social media text.

Abstract

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory