MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of   Transcribed Audio for Speech Recognition Research

Song Li; Yongbin You; Xuezhi Wang; Zhengkun Tian; Ke Ding; Guanglu Wan

arXiv:2406.18301·eess.AS·June 27, 2024

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan

PDF

Open Access 1 Datasets

TL;DR

This paper presents MSR-86K, a large-scale, publicly available multilingual speech corpus with 86,300 hours of transcribed audio from YouTube videos, aiming to advance multilingual automatic speech recognition research.

Contribution

The paper introduces MSR-86K, a new extensive multilingual speech dataset derived from YouTube videos, and demonstrates its use in training competitive multilingual ASR models.

Findings

01

MSR-86K covers 15 languages with 86,300 hours of data.

02

A multilingual ASR model trained on MSR-86K achieves performance comparable to Whisper.

03

The corpus will be publicly released on HuggingFace for research use.

Abstract

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Alex-Song/MSR-86K
dataset· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing