Automatic Speech Recognition Datasets in Cantonese: A Survey and New   Dataset

Tiezheng Yu; Rita Frieske; Peng Xu; Samuel Cahyawijaya; Cheuk Tung; Shadow Yiu; Holy Lovenia; Wenliang Dai; Elham J. Barezi; Qifeng Chen,; Xiaojuan Ma; Bertram E. Shi; Pascale Fung

arXiv:2201.02419·cs.CL·January 19, 2022·1 cites

Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Tiezheng Yu, Rita Frieske, Peng Xu, Samuel Cahyawijaya, Cheuk Tung, Shadow Yiu, Holy Lovenia, Wenliang Dai, Elham J. Barezi, Qifeng Chen,, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

PDF

Open Access 1 Repo 5 Models 2 Datasets

TL;DR

This paper surveys existing Cantonese ASR datasets, introduces a new 73.6-hour dataset called MDCC, and demonstrates its effectiveness through experiments with state-of-the-art models, enhancing Cantonese speech recognition.

Contribution

It provides a comprehensive review of Cantonese ASR datasets, introduces the new MDCC dataset, and shows improved ASR performance using multi-dataset learning.

Findings

01

MDCC improves ASR accuracy over existing datasets

02

Multi-dataset learning enhances robustness of Cantonese ASR

03

State-of-the-art models perform better with the new dataset

Abstract

Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hltchkust/cantonese-asr
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Multi-Head Attention · Byte Pair Encoding · Dense Connections · Softmax · Label Smoothing