Transcribe, Align and Segment: Creating speech datasets for low-resource   languages

Taras Sereda

arXiv:2406.12674·eess.AS·June 19, 2024

Transcribe, Align and Segment: Creating speech datasets for low-resource languages

Taras Sereda

PDF

Open Access

TL;DR

This paper presents a cost-effective approach for creating speech datasets for low-resource languages by transcribing, aligning, and segmenting unlabeled speech, demonstrated with Ukrainian podcasts and a new ASR model.

Contribution

The work introduces a novel method for generating speech datasets for low-resource languages and releases a new Ukrainian speech dataset and ASR model.

Findings

01

The UK-PODS dataset contains over 50 hours of Ukrainian speech data.

02

The uk-pods-conformer model achieves a 3x reduction in Word Error Rate.

03

The approach is effective for low-resource language speech dataset creation.

Abstract

In this work, we showcase a cost-effective method for generating training data for speech processing tasks. First, we transcribe unlabeled speech using a state-of-the-art Automatic Speech Recognition (ASR) model. Next, we align generated transcripts with the audio and apply segmentation on short utterances. Our focus is on ASR for low-resource languages, such as Ukrainian, using podcasts as a source of unlabeled speech. We release a new dataset UK-PODS that features modern conversational Ukrainian language. It contains over 50 hours of text audio-pairs as well as uk-pods-conformer, a 121 M parameters ASR model that is trained on MCV-10 and UK-PODS and achieves 3x reduction of Word Error Rate (WER) on podcasts comparing to publically available uk-nvidia-citrinet while maintaining comparable WER on MCV-10 test split. Both dataset UK-PODS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsALIGN · Focus