Language Modeling for Code-Switched Data: Challenges and Approaches

Ganji Sreeram; Rohit Sinha

arXiv:1711.03541·cs.CL·November 13, 2017·5 cites

Language Modeling for Code-Switched Data: Challenges and Approaches

Ganji Sreeram, Rohit Sinha

PDF

Open Access

TL;DR

This paper addresses the challenges of language modeling for intra-sentential code-switching, introducing a new corpus, POS features, and a novel CS-factor to improve prediction accuracy in bilingual speech applications.

Contribution

It creates a Hindi-English code-switching corpus, explores POS features for modeling, and proposes the CS-factor to enhance language model performance.

Findings

01

Significant reduction in perplexity with POS features.

02

Additional gain in perplexity using the CS-factor.

03

Improved modeling of intra-sentential code-switching data.

Abstract

Lately, the problem of code-switching has gained a lot of attention and has emerged as an active area of research. In bilingual communities, the speakers commonly embed the words and phrases of a non-native language into the syntax of a native language in their day-to-day communications. The code-switching is a global phenomenon among multilingual communities, still very limited acoustic and linguistic resources are available as yet. For developing effective speech based applications, the ability of the existing language technologies to deal with the code-switched data can not be over emphasized. The code-switching is broadly classified into two modes: inter-sentential and intra-sentential code-switching. In this work, we have studied the intra-sentential problem in the context of code-switching language modeling task. The salient contributions of this paper includes: (i) the creation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems