Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition
A. Biswas, F. de Wet, E. van der Westhuizen, T.R. Niesler

TL;DR
This paper explores semi-supervised training for English-isiZulu code-switched speech recognition, showing that incorporating automatically transcribed data improves acoustic models and reduces word error rate, despite limited impact on language modeling.
Contribution
It demonstrates the effectiveness of semi-supervised acoustic model training with CNN-TDNN-F architectures for code-switched speech recognition, using automatically transcribed multilingual data.
Findings
Semi-supervised data improves acoustic model performance.
Inclusion of CNN layers enhances results.
Word error rate reduced by up to 5.6% after iterative training.
Abstract
We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
