KinSPEAK: Improving speech recognition for Kinyarwanda via   semi-supervised learning methods

Antoine Nzeyimana

arXiv:2308.11863·eess.AS·March 5, 2024

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

Antoine Nzeyimana

PDF

Open Access

TL;DR

This paper introduces KinSPEAK, a semi-supervised learning approach that significantly improves Kinyarwanda speech recognition by leveraging self-supervised pre-training, curriculum learning, and unlabelled data, achieving state-of-the-art results.

Contribution

It presents a novel semi-supervised learning framework for Kinyarwanda speech recognition, utilizing public datasets, curriculum scheduling, and syllabic tokenization for improved performance.

Findings

01

Achieved 3.2% WER on new dataset

02

Achieved 15.6% WER on Mozilla Common Voice

03

Syllabic tokenization outperforms character-based methods

Abstract

Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in five successive generations. Our final model achieves 3.2% word error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing