Transcribe, Align and Segment: Creating speech datasets for low-resource languages
Taras Sereda

TL;DR
This paper presents a cost-effective approach for creating speech datasets for low-resource languages by transcribing, aligning, and segmenting unlabeled speech, demonstrated with Ukrainian podcasts and a new ASR model.
Contribution
The work introduces a novel method for generating speech datasets for low-resource languages and releases a new Ukrainian speech dataset and ASR model.
Findings
The UK-PODS dataset contains over 50 hours of Ukrainian speech data.
The uk-pods-conformer model achieves a 3x reduction in Word Error Rate.
The approach is effective for low-resource language speech dataset creation.
Abstract
In this work, we showcase a cost-effective method for generating training data for speech processing tasks. First, we transcribe unlabeled speech using a state-of-the-art Automatic Speech Recognition (ASR) model. Next, we align generated transcripts with the audio and apply segmentation on short utterances. Our focus is on ASR for low-resource languages, such as Ukrainian, using podcasts as a source of unlabeled speech. We release a new dataset UK-PODS that features modern conversational Ukrainian language. It contains over 50 hours of text audio-pairs as well as uk-pods-conformer, a 121 M parameters ASR model that is trained on MCV-10 and UK-PODS and achieves 3x reduction of Word Error Rate (WER) on podcasts comparing to publically available uk-nvidia-citrinet while maintaining comparable WER on MCV-10 test split. Both dataset UK-PODS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsALIGN · Focus
