Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

MD. Sagor Chowdhury; Adiba Fairooz Chowdhury

arXiv:2602.21741·cs.CL·February 26, 2026

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

MD. Sagor Chowdhury, Adiba Fairooz Chowdhury

PDF

Open Access

TL;DR

This paper presents an end-to-end Bengali speech recognition and speaker diarization system that addresses language-specific challenges through domain-specific fine-tuning, source separation, and silence-aware chunking, achieving competitive error rates.

Contribution

The work introduces a novel Bengali-specific pipeline for ASR and diarization, combining fine-tuned models and source separation techniques tailored for low-resource Bengali speech processing.

Findings

01

Achieved a private WER of 0.37738 for ASR.

02

Achieved a private DER of 0.27671 for diarization.

03

Domain-specific fine-tuning significantly improves performance.

Abstract

We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing