# Semi-supervised acoustic model training for five-lingual code-switched   ASR

**Authors:** Astik Biswas, Emre Y{\i}lmaz, Febe de Wet, Ewald van der Westhuizen,, Thomas Niesler

arXiv: 1906.08647 · 2019-10-16

## TL;DR

This paper explores semi-supervised training of acoustic models for five South African languages in code-switched speech, comparing bilingual and unified five-lingual models, and assessing their performance improvements.

## Contribution

It introduces a semi-supervised training approach for both bilingual and five-lingual acoustic models in under-resourced code-switched speech recognition.

## Key findings

- Semi-supervised training improves model performance.
- CNN layers benefit bilingual models but not the five-lingual model.
- English dominates the unified language model, improving English ASR but affecting other languages.

## Abstract

This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.08647/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1906.08647/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/1906.08647/full.md

---
Source: https://tomesphere.com/paper/1906.08647