A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
Zarif Ishmam, Zarif Mahir, Shafnan Wasif, Md. Ishtiak Moin

TL;DR
This paper introduces a comprehensive framework for robust Bangla automatic speech recognition and speaker diarization, optimized for long audio content through advanced VAD and CTC techniques, addressing challenges in low-resource language processing.
Contribution
It presents novel optimization pipelines and finetuning strategies specifically designed for longform Bangla speech, improving accuracy and scalability.
Findings
Enhanced transcription accuracy over long audio segments
Improved speaker diarization in multi-speaker environments
Effective noise removal and data augmentation techniques
Abstract
Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) and Speaker Diarization systems for Bangla struggles when processing longform audio exceeding 3060 seconds. This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation via forced word alignment to maintain temporal accuracy and transcription integrity over long durations. Additionally, we employed several finetuning techniques and preprocessed the data using augmentation techniques and noise removal. By bridging the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
