Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation
Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu

TL;DR
This paper explores lexical replacement techniques for augmenting Arabic-English code-switched data, demonstrating that predictive models produce more natural sentences and significantly improve performance across NLP tasks.
Contribution
It introduces sequence-to-sequence based lexical replacements for CS data augmentation and compares them with dictionary-based methods, showing their effectiveness in NLP tasks.
Findings
Predictive model yields more natural code-switched sentences.
Both augmentation approaches outperform dictionary-based replacements.
Data augmentation improves MT, ASR, and ST performance significantly.
Abstract
Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
