Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset
H.A.Z. Sameen Shahgir, Khondker Salman Sayeed, Tanjeem Azwad Zaman

TL;DR
This paper fine-tunes wav2vec 2.0 for Bengali speech recognition using the Common Voice dataset, achieving improved accuracy and outperforming other models on a hidden test set.
Contribution
It presents the first application of wav2vec 2.0 to Bengali speech recognition with detailed training and evaluation results.
Findings
Achieved a WER of 25.24% on validation set.
Reduced Levenshtein Distance to 2.607 on test set after additional training.
Outperformed competing models with a Levenshtein Distance of 6.234 on hidden data.
Abstract
Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
MethodsTest
