Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text
Ramchandra Joshi, Raviraj Joshi

TL;DR
This paper investigates various input representations and deep learning models for language identification in Hindi-English code-mixed text, finding sub-word embeddings with LSTM yield the highest accuracy.
Contribution
It systematically evaluates input representations and deep learning architectures for language identification in code-mixed text, highlighting the effectiveness of sub-word embeddings with LSTM.
Findings
Sub-word representations outperform character and word embeddings.
LSTM models achieve higher accuracy than CNN models.
Best accuracy of 94.52% on standard dataset.
Abstract
Natural language processing (NLP) techniques have become mainstream in the recent decade. Most of these advances are attributed to the processing of a single language. More recently, with the extensive growth of social media platforms focus has shifted to code-mixed text. The code-mixed text comprises text written in more than one language. People naturally tend to combine local language with global languages like English. To process such texts, current NLP techniques are not sufficient. As a first step, the text is processed to identify the language of the words in the text. In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text. The task of language identification is formulated as a token classification task. In the supervised setting, each word in the sentence has an associated language label. We evaluate different deep learning models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
