Indonesian Automatic Speech Recognition with XLSR-53
Panji Arisaputra, Amalia Zahra

TL;DR
This paper demonstrates that using the XLSR-53 pre-trained model enables effective Indonesian speech recognition with limited data, achieving competitive accuracy and improved WER through language model integration.
Contribution
The study introduces an Indonesian ASR system utilizing XLSR-53, reducing data requirements and enhancing performance compared to previous methods.
Findings
Achieved 20% WER with 24 hours of data.
Reduced WER to 12% with a language model.
Outperformed previous Indonesian ASR models.
Abstract
This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model, the XLSR stands for cross-lingual speech representations. The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages required to achieve a competitive Word Error Rate (WER). The total amount of data used in this study is 24 hours, 18 minutes, and 1 second: (1) TITML-IDN 14 hours and 31 minutes; (2) Magic Data 3 hours and 33 minutes; and (3) Common Voice 6 hours, 14 minutes, and 1 second. With a WER of 20%, the model built in this study can compete with similar models using the Common Voice dataset split test. WER can be decreased by around 8% using a language model, resulted in WER from 20% to 12%. Thus, the results of this study have succeeded in perfecting previous research in contributing to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsXLSR
