FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation,   Casing, and Context

Anna Povey; Katherine Povey

arXiv:2410.00035·eess.AS·October 2, 2024

FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context

Anna Povey, Katherine Povey

PDF

Open Access 3 Datasets

TL;DR

FeruzaSpeech is a comprehensive 60-hour Uzbek read speech corpus with transcripts in Cyrillic and Latin, designed to improve speech recognition systems and freely accessible for research.

Contribution

The paper introduces FeruzaSpeech, a new high-quality Uzbek speech corpus with diverse content and dual-script transcripts, enhancing speech recognition performance.

Findings

01

Improved Word Error Rates on Uzbek speech datasets

02

High-quality recordings from a native speaker

03

Availability of the corpus for academic research

Abstract

This paper introduces FeruzaSpeech, a read speech corpus of the Uzbek language, containing transcripts in both Cyrillic and Latin alphabets, freely available for academic research purposes. This corpus includes 60 hours of high-quality recordings from a single native female speaker from Tashkent, Uzbekistan. These recordings consist of short excerpts from a book and BBC News. This paper discusses the enhancement of the Word Error Rates (WERs) on CommonVoice 16.1's Uzbek data, Uzbek Speech Corpus data, and FeruzaSpeech data upon integrating FeruzaSpeech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis