Language Identification of Bengali-English Code-Mixed data using Character & Phonetic based LSTM Models
Soumil Mandal, Sourya Dipta Das, Dipankar Das

TL;DR
This paper introduces LSTM-based models for identifying languages in Bengali-English social media text, effectively handling code-mixing and phonetic transliterations, achieving over 92% accuracy.
Contribution
It proposes novel character and phonetic encoding methods combined with ensemble techniques for improved language identification in low-resource code-mixed data.
Findings
Achieved 91.78% and 92.35% accuracy with ensemble models.
Demonstrated effectiveness of phonetic and character-based encodings.
Addressed challenges of code-mixing and transliteration in social media text.
Abstract
Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using stacking and threshold technique which gave 91.78% and 92.35% accuracies respectively on our testing data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
