Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences
Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, Pascale Fung

TL;DR
This paper introduces a neural sequence-to-sequence model with a copy mechanism that generates synthetic code-switching data from parallel monolingual sentences, improving language models and speech recognition without external linguistic tools.
Contribution
It presents a novel neural model that learns to generate realistic code-switching sentences using only parallel data, avoiding external alignment tools and capturing linguistic constraints.
Findings
Achieves state-of-the-art language model performance.
Enhances end-to-end speech recognition accuracy.
Generates high-quality synthetic code-switching data.
Abstract
Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel monolingual translations from a limited source of code-switching data. The model learns how to combine words from parallel sentences and identifies when to switch one language to the other. Moreover, it captures code-switching constraints by attending and aligning the words in inputs, without requiring any external knowledge. Based on experimental results, the language model trained with the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
