Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language Modeling
Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, Pascale Fung

TL;DR
This paper introduces a novel data augmentation method for code-switching language models using a Seq2Seq and pointer network approach to generate realistic code-switching sentences from parallel corpora, improving language model perplexity.
Contribution
The work presents a new technique combining Seq2Seq and pointer networks to generate code-switching data without relying on linguistic constraints, enhancing data diversity.
Findings
Perplexity score improved by 10% with augmented data
Method effectively generates grammatical code-switching sentences
Outperforms baseline LSTM language model
Abstract
Building large-scale datasets for training code-switching language models is challenging and very expensive. To alleviate this problem using parallel corpus has been a major workaround. However, existing solutions use linguistic constraints which may not capture the real data distribution. In this work, we propose a novel method for learning how to generate code-switching sentences from parallel corpora. Our model uses a Seq2Seq model in combination with pointer networks to align and choose words from the monolingual sentences and form a grammatical code-switching sentence. In our experiment, we show that by training a language model using the augmented sentences we improve the perplexity score by 10% compared to the LSTM baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Sequence to Sequence · Long Short-Term Memory
