An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Epshita Jahan; Khandoker Md Tanjinul Islam; Pritom Biswas; Tafsir Al Nafin

arXiv:2603.03158·cs.SD·March 4, 2026

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Epshita Jahan, Khandoker Md Tanjinul Islam, Pritom Biswas, Tafsir Al Nafin

PDF

Open Access

TL;DR

This paper explores various approaches to improve Bengali long-form speech transcription and speaker diarization, achieving significant accuracy improvements through fine-tuning and strategic data processing in a low-resource language context.

Contribution

It introduces a multistage approach combining fine-tuned Whisper ASR and custom speaker diarization models, tailored for Bengali in challenging acoustic environments.

Findings

01

DER of 0.27 on private leaderboard

02

WER of 0.38 on private leaderboard

03

Effective use of fine-tuning and data strategies

Abstract

Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle, addressing the challenge of "who spoke when/what" in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Machine Learning and Data Classification