Libriheavy: a 50,000 hours ASR corpus with punctuation casing and   context

Wei Kang; Xiaoyu Yang; Zengwei Yao; Fangjun Kuang; Yifan Yang; Liyong; Guo; Long Lin; Daniel Povey

arXiv:2309.08105·eess.AS·January 17, 2024·1 cites

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong, Guo, Long Lin, Daniel Povey

PDF

Open Access 2 Repos 2 Models 2 Datasets

TL;DR

Libriheavy is a large-scale, freely-available ASR corpus with 50,000 hours of English speech, enriched with punctuation, casing, and context information, enabling more flexible speech system development.

Contribution

The paper introduces Libriheavy, the largest open-source speech corpus with detailed annotations and a scalable pipeline for audio-text alignment, expanding resources for speech recognition research.

Findings

01

Libriheavy contains 50,000 hours of speech with rich annotations.

02

Baseline ASR models achieve competitive performance on the dataset.

03

Open-source pipeline facilitates dataset creation for other tasks.

Abstract

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsALIGN