GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Yifan Yang; Zheshu Song; Jianheng Zhuo; Mingyu Cui; Jinpeng Li; Bo Yang; Yexing Du; Ziyang Ma; Xunying Liu; Ziyuan Wang; Ke Li; Shuai Fan; Kai Yu; Wei-Qiang Zhang; Guoguo Chen; Xie Chen

arXiv:2406.11546·eess.AS·May 28, 2025·2 cites

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

PDF

Open Access 2 Repos 4 Datasets 1 Video 5 Reviews

TL;DR

GigaSpeech 2 is a large-scale, multi-domain multilingual speech corpus designed for low-resource languages, created through automated crawling, transcription, and refinement, enabling improved ASR performance with minimal labeled data.

Contribution

The paper introduces GigaSpeech 2, a novel large-scale multilingual speech dataset with an automated pipeline for data collection and refinement, specifically aiding low-resource language ASR development.

Findings

01

ASR models trained on GigaSpeech 2 reduce word error rate by 25-40% for Thai, Indonesian, and Vietnamese.

02

The corpus and pipeline outperform commercial ASR services on challenging test sets.

03

Automated data collection and refinement significantly enhance low-resource speech recognition.

Abstract

The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 1Confidence 5

Strengths

The language resource will be of value to the ASR community. The methodology for constructing the data set is technically sound, though more advanced methods for e.g. forced alignment exist (only off-the-shelf tools were used here). Results demonstrate the value of the training data.

Weaknesses

The paper describes construction of a language resource and empirical ASR results demonstrating essentially that training on more relevant data helps lower WERs. Evaluation was on 3 languages only. Most importantly, the paper does not present a scientific hypothesis or any advancement in methodologies; it is purely a resource and evaluation paper and as such would be better place in a speech conference or language resources conference.

Reviewer 02Rating 3Confidence 4

Strengths

The speech dataset from Thai, Indonesian, and Vietnamese are usually limited, this Gigaspeech2 would be beneficial to the speech community which target those languages. While the target languages is limited, the experiments comparing with other commercial models, open-source models and datasets are convincing. They demonstrate the proposed pipeline and collected dataset achieves good quality (for example compared with YODAS dataset)

Weaknesses

Despite the usefulness of the dataset itself, the main weakness is that the general novelty is limited in this work. The pipeline itself does not have many components that are significantly different from the existing dataset collection procedures. For example, one of the main contribution authors claim is to apply Noisy Student Training iteratively to refine collected dataset, however, this is a quite standard approach and is already applied in one of the original Noisy student training paper (

Reviewer 03Rating 6Confidence 5

Strengths

- Gigaspeech 2 test sets: a domain-rich human labelled benchmark enabling realistic ASR performance evaluation for Thai, Indonesian, and Vietnamese. - A comprehensive speech dataset featuring 30K hours of raw audio across three low-resource Southeast Asian languages: Thai, Indonesian, and Vietnamese. - High-performance ASR systems for the three languages, on par with commercially available solutions.

Weaknesses

- Although GigaSpeech 2 spans multiple domains, the performance drops on out-of-domain test sets (Common Voice, FLEURS) in Indonesian and Vietnamese suggest limitations in cross-domain generalizability (Table 3). - Even though the authors mention zero reliance on the labelled data, the approach is strictly dependent on having an initial good seed model which can be used for transcription and timestamp prediction; hence, this approach does not completely preclude the need for labelled data. - The

Reviewer 04Rating 6Confidence 5

Strengths

-strong ASR resource for 3 languages: indo, vn and thai => strong impact for those 3 languages -methodology (multiple iteration approach) is applicable to many other languages

Weaknesses

-this is overall a good recipe for collecting large and good quality speech resources, but it applies well-known techniques and eventually we get a resource with 3 languages covered only (extending language coverage is mentioned as future work) -for readers not familar with NST method, section 3.2 is not easy to understand and would benefit from a bit more context on base NFT method before describing how it was modified by the authors -the comparison to existing ASR systems is convincing for T

Reviewer 05Rating 3Confidence 4

Strengths

- Dataset for speech is very important, and the speech community is thriving to see a new massive dataset. - The proposed data creation flow is clear, and code is provided for reproducibility. - The provided leaderboard gives the community a better sense of data difficulty. The reported WERs make me more convinced that the test set is realistic and applicable.

Weaknesses

## Writing - W1: Please use the citation format correctly. It should be \citep and \citet, depending on the usage. It seems like the current version is a direct transfer from NeurIPS, which uses \cite. - W2: The organization of the paper is a little messy. There are too many bold texts with multiple hierarchies (for example Sec 3.1). At the same time, there are also numbered points in Sec 4. When finish reading the paper, I already get lost on what the contributions of the paper are. I suggest m

Code & Models

Repositories

Datasets

Videos

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · Dropout · Stochastic Depth · RandAugment · Noisy Student