Language Modeling for Code-Switched Data: Challenges and Approaches
Ganji Sreeram, Rohit Sinha

TL;DR
This paper addresses the challenges of language modeling for intra-sentential code-switching, introducing a new corpus, POS features, and a novel CS-factor to improve prediction accuracy in bilingual speech applications.
Contribution
It creates a Hindi-English code-switching corpus, explores POS features for modeling, and proposes the CS-factor to enhance language model performance.
Findings
Significant reduction in perplexity with POS features.
Additional gain in perplexity using the CS-factor.
Improved modeling of intra-sentential code-switching data.
Abstract
Lately, the problem of code-switching has gained a lot of attention and has emerged as an active area of research. In bilingual communities, the speakers commonly embed the words and phrases of a non-native language into the syntax of a native language in their day-to-day communications. The code-switching is a global phenomenon among multilingual communities, still very limited acoustic and linguistic resources are available as yet. For developing effective speech based applications, the ability of the existing language technologies to deal with the code-switched data can not be over emphasized. The code-switching is broadly classified into two modes: inter-sentential and intra-sentential code-switching. In this work, we have studied the intra-sentential problem in the context of code-switching language modeling task. The salient contributions of this paper includes: (i) the creation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
