YODAS: Youtube-Oriented Dataset for Audio and Speech

Xinjian Li; Shinnosuke Takamichi; Takaaki Saeki; William Chen; Sayaka; Shiota; Shinji Watanabe

arXiv:2406.00899·cs.CL·June 4, 2024

YODAS: Youtube-Oriented Dataset for Audio and Speech

Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka, Shiota, Shinji Watanabe

PDF

Open Access 8 Models 5 Datasets

TL;DR

YODAS is a large-scale, multilingual YouTube speech dataset with over 500,000 hours of data across 100+ languages, supporting supervised and self-supervised speech recognition research.

Contribution

This paper introduces YODAS, the first publicly available large-scale multilingual YouTube speech dataset, including collection methodology and baseline speech recognition results.

Findings

01

YODAS contains over 500,000 hours of speech data in 100+ languages.

02

The dataset supports both supervised and self-supervised learning approaches.

03

Baseline speech recognition models achieve promising results on the top-15 languages.

Abstract

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization