Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing N-gram Language Models
Mohammed Rakib, Md. Ismail Hossain, Nabeel Mohammed, Fuad Rahman

TL;DR
This paper enhances Bangla speech-to-text transcription by fine-tuning wav2vec2 models with the Bengali Common Voice dataset and integrating n-gram language models, achieving superior performance over existing models.
Contribution
It introduces a novel approach combining pretrained wav2vec2 fine-tuning with n-gram language models for improved Bangla ASR performance.
Findings
Outperforms state-of-the-art Bengali ASR models
Significant accuracy improvement with n-gram language models
Robust Bangla ASR model through hyperparameter tuning
Abstract
Although over 300M around the world speak Bangla, scant work has been done in improving Bangla voice-to-text transcription due to Bangla being a low-resource language. However, with the introduction of the Bengali Common Voice 9.0 speech dataset, Automatic Speech Recognition (ASR) models can now be significantly improved. With 399hrs of speech recordings, Bengali Common Voice is the largest and most diversified open-source Bengali speech corpus in the world. In this paper, we outperform the SOTA pretrained Bengali ASR models by finetuning a pretrained wav2vec2 model on the common voice dataset. We also demonstrate how to significantly improve the performance of an ASR model by adding an n-gram language model as a post-processor. Finally, we do some experiments and hyperparameter tuning to generate a robust Bangla ASR model that is better than the existing ASR models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
