Automatic Speech Recognition for Humanitarian Applications in Somali
Raghav Menon, Astik Biswas, Armin Saeb, John Quinn, Thomas Niesler

TL;DR
This paper develops an initial Somali speech recognition system with limited data, utilizing neural architectures and data augmentation techniques, achieving a 53.75% word error rate for humanitarian applications.
Contribution
It introduces the first Somali speech recognition system using neural models and data augmentation, tailored for under-resourced languages in humanitarian contexts.
Findings
Data augmentation improves performance
Neural architectures outperform traditional models
Achieved 53.75% WER with CNN, TDNN, and BLSTM
Abstract
We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
