Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition
Enes Yavuz Ugan, Christian Huber, Juan Hussain, Alexander Waibel

TL;DR
This paper introduces a data augmentation technique for end-to-end speech recognition models that improves transcription accuracy of code-switching speech, especially in low-resource scenarios, by concatenating audio and labels from different languages.
Contribution
It proposes a simple concatenation-based data augmentation method to enhance multilingual E2E speech recognition models for code-switching scenarios.
Findings
Improves CS speech transcription accuracy
Surpasses monolingual models on monolingual tests
Enhances performance on unseen language switches by 5.03% WER
Abstract
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. While today's neural end-to-end (E2E) models deliver state-of-the-art performances on the task of automatic speech recognition (ASR) it is commonly known that these systems are very data-intensive. However, there is only a few transcribed and aligned CS speech available. To overcome this problem and train multilingual systems which can transcribe CS speech, we propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are concatenated. By using this training data, our E2E model improves on transcribing CS speech. It also surpasses monolingual models on monolingual tests. The results show that this augmentation technique can even improve the model's performance on inter-sentential language switches not seen during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
