Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset
Tiezheng Yu, Rita Frieske, Peng Xu, Samuel Cahyawijaya, Cheuk Tung, Shadow Yiu, Holy Lovenia, Wenliang Dai, Elham J. Barezi, Qifeng Chen,, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

TL;DR
This paper surveys existing Cantonese ASR datasets, introduces a new 73.6-hour dataset called MDCC, and demonstrates its effectiveness through experiments with state-of-the-art models, enhancing Cantonese speech recognition.
Contribution
It provides a comprehensive review of Cantonese ASR datasets, introduces the new MDCC dataset, and shows improved ASR performance using multi-dataset learning.
Findings
MDCC improves ASR accuracy over existing datasets
Multi-dataset learning enhances robustness of Cantonese ASR
State-of-the-art models perform better with the new dataset
Abstract
Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗alvanlii/wav2vec2-BERT-cantonesemodel· 80 dl· ♡ 680 dl♡ 6
- 🤗alvanlii/whisper-small-cantonesemodel· 1.2k dl· ♡ 1111.2k dl♡ 111
- 🤗alvanlii/distil-whisper-small-cantonesemodel· 51 dl· ♡ 951 dl♡ 9
- 🤗liushiufaiedward/whisper-small-cantonesemodel· 4 dl4 dl
- 🤗hyperkit/distil-whisper-small-cantonese-coremlmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Multi-Head Attention · Byte Pair Encoding · Dense Connections · Softmax · Label Smoothing
