A Simple and Efficient Probabilistic Language model for Code-Mixed Text

M Zeeshan Ansari; Tanvir Ahmad; M M Sufyan Beg; Asma Ikram

arXiv:2106.15102·cs.CL·June 30, 2021

A Simple and Efficient Probabilistic Language model for Code-Mixed Text

M Zeeshan Ansari, Tanvir Ahmad, M M Sufyan Beg, Asma Ikram

PDF

Open Access

TL;DR

This paper introduces a simple probabilistic word embedding method tailored for code-mixed text, demonstrating improved language identification accuracy on Hindi-English Twitter data using machine learning classifiers.

Contribution

The paper proposes a novel probabilistic approach for creating efficient word embeddings specifically for code-mixed language identification tasks.

Findings

01

Improved accuracy over existing code-mixed embeddings

02

Effective use with bidirectional LSTMs and SVM classifiers

03

Demonstrated on Hindi-English Twitter data

Abstract

The conventional natural language processing approaches are not accustomed to the social media text due to colloquial discourse and non-homogeneous characteristics. Significantly, the language identification in a multilingual document is ascertained to be a preceding subtask in several information extraction applications such as information retrieval, named entity recognition, relation extraction, etc. The problem is often more challenging in code-mixed documents wherein foreign languages words are drawn into base language while framing the text. The word embeddings are powerful language modeling tools for representation of text documents useful in obtaining similarity between words or documents. We present a simple probabilistic approach for building efficient word embedding for code-mixed text and exemplifying it over language identification of Hindi-English short test messages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification