Multilingual training set selection for ASR in under-resourced Malian languages
Ewald van der Westhuizen, Trideba Padhi, Thomas Niesler

TL;DR
This paper investigates optimal multilingual training data selection for speech recognition in severely under-resourced Malian languages, demonstrating that judicious choice of additional languages improves performance more than simply increasing data volume.
Contribution
It introduces a method for selecting the most beneficial out-of-language data for multilingual ASR training in under-resourced languages, challenging the assumption that more data always yields better results.
Findings
Adding only one carefully chosen language improves recognition accuracy.
Selective data inclusion outperforms pooling all available languages.
Targeted multilingual training reduces word error rate significantly.
Abstract
We present first speech recognition systems for the two severely under-resourced Malian languages Bambara and Maasina Fulfulde. These systems will be used by the United Nations as part of a monitoring system to inform and support humanitarian programmes in rural Africa. We have compiled datasets in Bambara and Maasina Fulfulde, but since these are very small, we take advantage of six similarly under-resourced datasets in other languages for multilingual training. We focus specifically on the best composition of the multilingual pool of speech data for multilingual training. We find that, although maximising the training pool by including all six additional languages provides improved speech recognition in both target languages, substantially better performance can be achieved by a more judicious choice. Our experiments show that the addition of just one language provides best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
