A Simple and Efficient Probabilistic Language model for Code-Mixed Text
M Zeeshan Ansari, Tanvir Ahmad, M M Sufyan Beg, Asma Ikram

TL;DR
This paper introduces a simple probabilistic word embedding method tailored for code-mixed text, demonstrating improved language identification accuracy on Hindi-English Twitter data using machine learning classifiers.
Contribution
The paper proposes a novel probabilistic approach for creating efficient word embeddings specifically for code-mixed language identification tasks.
Findings
Improved accuracy over existing code-mixed embeddings
Effective use with bidirectional LSTMs and SVM classifiers
Demonstrated on Hindi-English Twitter data
Abstract
The conventional natural language processing approaches are not accustomed to the social media text due to colloquial discourse and non-homogeneous characteristics. Significantly, the language identification in a multilingual document is ascertained to be a preceding subtask in several information extraction applications such as information retrieval, named entity recognition, relation extraction, etc. The problem is often more challenging in code-mixed documents wherein foreign languages words are drawn into base language while framing the text. The word embeddings are powerful language modeling tools for representation of text documents useful in obtaining similarity between words or documents. We present a simple probabilistic approach for building efficient word embedding for code-mixed text and exemplifying it over language identification of Hindi-English short test messages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
