Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Mohammed Aman Bhuiyan; Md Sazzad Hossain Adib; Samiul Basir Bhuiyan; Amit Chakraborty; Aritra Islam Saswato; Ahmed Faizul Haque Dhrubo; Mohammad Ashrafuzzaman Khan

arXiv:2605.08214·cs.SD·May 12, 2026

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Mohammed Aman Bhuiyan, Md Sazzad Hossain Adib, Samiul Basir Bhuiyan, Amit Chakraborty, Aritra Islam Saswato, Ahmed Faizul Haque Dhrubo, Mohammad Ashrafuzzaman Khan

PDF

TL;DR

This paper presents a comprehensive approach to Bangla long-form speech recognition and speaker diarization by fine-tuning existing models with extensive data augmentation, achieving significant improvements in accuracy.

Contribution

It introduces a novel pipeline for Bangla speech tasks, combining fine-tuned Whisper and PyAnnote models with advanced data augmentation techniques.

Findings

01

ASR WER of 0.2441 on test set

02

Diarization DER of 0.2392 on test set

03

Significant improvements over pretrained baselines

Abstract

Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.