Investigating self-supervised, weakly supervised and fully supervised   training approaches for multi-domain automatic speech recognition: a study on   Bangladeshi Bangla

Ahnaf Mozib Samin; M. Humayon Kobir; Md. Mushtaq Shahriyar Rafee; M.; Firoz Ahmed; Mehedi Hasan; Partha Ghosh; Shafkat Kibria; and M. Shahidur; Rahman

arXiv:2210.12921·cs.CL·May 12, 2023

Investigating self-supervised, weakly supervised and fully supervised training approaches for multi-domain automatic speech recognition: a study on Bangladeshi Bangla

Ahnaf Mozib Samin, M. Humayon Kobir, Md. Mushtaq Shahriyar Rafee, M., Firoz Ahmed, Mehedi Hasan, Partha Ghosh, Shafkat Kibria, and M. Shahidur, Rahman

PDF

Open Access

TL;DR

This study compares self-supervised, weakly supervised, and fully supervised training methods for multi-domain Bangla speech recognition, highlighting the effectiveness of self-supervised pre-training and the importance of domain selection.

Contribution

It introduces a novel multi-domain Bangla ASR benchmark and evaluates different training approaches, emphasizing the advantages of self-supervised pre-training for robustness.

Findings

01

Self-supervised wav2vec 2.0 outperforms other methods in multi-domain ASR.

02

Models trained on SUBAK.KO struggle with spontaneous speech domains.

03

BanSpeech benchmark will be publicly available for further research.

Abstract

Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to domain shifting. This is mainly because principal corpus design criteria are often not identified and examined adequately while compiling ASR datasets. In this study, we investigate the robustness of the state-of-the-art transfer learning approaches such as self-supervised wav2vec 2.0 and weakly supervised Whisper as well as fully supervised convolutional neural networks (CNNs) for multi-domain ASR. We also demonstrate the significance of domain selection while building a corpus by assessing these models on a novel multi-domain Bangladeshi Bangla ASR evaluation benchmark - BanSpeech, which contains approximately 6.52 hours of human-annotated speech and 8085 utterances from 13 distinct domains. SUBAK.KO, a mostly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLayer Normalization