Training a code-switching language model with monolingual data
Shun-Po Chuang, Tzu-Wei Sung, Hung-Yi Lee

TL;DR
This paper introduces a method to train code-switching language models using only monolingual data by adjusting the output projection matrix, resulting in improved performance comparable to models trained on artificially generated code-switching data.
Contribution
The paper presents a novel approach to train code-switching language models solely with monolingual data by constraining the output projection matrix, enhancing cross-lingual embeddings.
Findings
Improved performance of CS language models trained on monolingual data.
Comparable or better results than models trained on artificially generated CS data.
Semantic alignment of words across languages demonstrated via bilingual translation analysis.
Abstract
A lack of code-switching data complicates the training of code-switching (CS) language models. We propose an approach to train such CS language models on monolingual data only. By constraining and normalizing the output projection matrix in RNN-based language models, we bring embeddings of different languages closer to each other. Numerical and visualization results show that the proposed approaches remarkably improve the performance of CS language models trained on monolingual data. The proposed approaches are comparable or even better than training CS language models with artificially generated CS data. We additionally use unsupervised bilingual word translation to analyze whether semantically equivalent words in different languages are mapped together.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
