Enhancing Indonesian Automatic Speech Recognition: Evaluating   Multilingual Models with Diverse Speech Variabilities

Aulia Adila; Dessi Lestari; Ayu Purwarianti; Dipta Tanaya; Kurniawati; Azizah; Sakriani Sakti

arXiv:2410.08828·cs.CL·October 15, 2024

Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities

Aulia Adila, Dessi Lestari, Ayu Purwarianti, Dipta Tanaya, Kurniawati, Azizah, Sakriani Sakti

PDF

Open Access

TL;DR

This study evaluates multilingual speech recognition models, especially Whisper, on diverse Indonesian speech data with various variabilities, highlighting the importance of speech style in model performance.

Contribution

It introduces a new Indonesian speech dataset with diverse variabilities and assesses the performance of MMS and Whisper models on this data.

Findings

01

Whisper fine-tuned model achieved the best accuracy.

02

Speaking style variability significantly impacts model performance.

03

Multilingual models can effectively transcribe diverse Indonesian speech data.

Abstract

An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models' predictive ability to transcribe Indonesian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems