Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language   Modeling

Genta Indra Winata; Andrea Madotto; Chien-Sheng Wu; Pascale Fung

arXiv:1810.10254·cs.CL·October 31, 2018·21 cites

Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language Modeling

Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, Pascale Fung

PDF

Open Access

TL;DR

This paper introduces a novel data augmentation method for code-switching language models using a Seq2Seq and pointer network approach to generate realistic code-switching sentences from parallel corpora, improving language model perplexity.

Contribution

The work presents a new technique combining Seq2Seq and pointer networks to generate code-switching data without relying on linguistic constraints, enhancing data diversity.

Findings

01

Perplexity score improved by 10% with augmented data

02

Method effectively generates grammatical code-switching sentences

03

Outperforms baseline LSTM language model

Abstract

Building large-scale datasets for training code-switching language models is challenging and very expensive. To alleviate this problem using parallel corpus has been a major workaround. However, existing solutions use linguistic constraints which may not capture the real data distribution. In this work, we propose a novel method for learning how to generate code-switching sentences from parallel corpora. Our model uses a Seq2Seq model in combination with pointer networks to align and choose words from the monolingual sentences and form a grammatical code-switching sentence. In our experiment, we show that by training a language model using the augmented sentences we improve the perplexity score by 10% compared to the LSTM baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSigmoid Activation · Tanh Activation · Sequence to Sequence · Long Short-Term Memory