A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

Zarif Ishmam; Zarif Mahir; Shafnan Wasif; Md. Ishtiak Moin

arXiv:2602.22935·cs.SD·February 27, 2026

A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

Zarif Ishmam, Zarif Mahir, Shafnan Wasif, Md. Ishtiak Moin

PDF

Open Access

TL;DR

This paper introduces a comprehensive framework for robust Bangla automatic speech recognition and speaker diarization, optimized for long audio content through advanced VAD and CTC techniques, addressing challenges in low-resource language processing.

Contribution

It presents novel optimization pipelines and finetuning strategies specifically designed for longform Bangla speech, improving accuracy and scalability.

Findings

01

Enhanced transcription accuracy over long audio segments

02

Improved speaker diarization in multi-speaker environments

03

Effective noise removal and data augmentation techniques

Abstract

Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) and Speaker Diarization systems for Bangla struggles when processing longform audio exceeding 3060 seconds. This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation via forced word alignment to maintain temporal accuracy and transcription integrity over long durations. Additionally, we employed several finetuning techniques and preprocessed the data using augmentation techniques and noise removal. By bridging the performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing