ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

Md. Nazmus Sakib; Shafiul Tanvir; Mesbah Uddin Ahamed; H.M. Aktaruzzaman Mukdho

arXiv:2603.19256·cs.CL·March 23, 2026

ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

Md. Nazmus Sakib, Shafiul Tanvir, Mesbah Uddin Ahamed, H.M. Aktaruzzaman Mukdho

PDF

Open Access

TL;DR

This paper introduces a data-centric approach for Bengali speech recognition and speaker diarization, achieving competitive results through domain-adaptive fine-tuning and high-quality data construction.

Contribution

It presents a novel data pipeline and fine-tuning strategies for Bengali ASR and diarization, addressing low-resource challenges.

Findings

01

Achieved WER of 15.55% on private test set for speech recognition.

02

Attained DER of 0.267 on private test set for diarization.

03

Demonstrated effectiveness of data engineering and domain adaptation in low-resource settings.

Abstract

Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the pyannote.audio community-1 segmentation model with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing