Semi-supervised acoustic and language model training for English-isiZulu   code-switched speech recognition

A. Biswas; F. de Wet; E. van der Westhuizen; T.R. Niesler

arXiv:2004.04054·eess.AS·April 9, 2020·1 cites

Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

A. Biswas, F. de Wet, E. van der Westhuizen, T.R. Niesler

PDF

Open Access

TL;DR

This paper explores semi-supervised training for English-isiZulu code-switched speech recognition, showing that incorporating automatically transcribed data improves acoustic models and reduces word error rate, despite limited impact on language modeling.

Contribution

It demonstrates the effectiveness of semi-supervised acoustic model training with CNN-TDNN-F architectures for code-switched speech recognition, using automatically transcribed multilingual data.

Findings

01

Semi-supervised data improves acoustic model performance.

02

Inclusion of CNN layers enhances results.

03

Word error rate reduced by up to 5.6% after iterative training.

Abstract

We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research