An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization
Epshita Jahan, Khandoker Md Tanjinul Islam, Pritom Biswas, Tafsir Al Nafin

TL;DR
This paper explores various approaches to improve Bengali long-form speech transcription and speaker diarization, achieving significant accuracy improvements through fine-tuning and strategic data processing in a low-resource language context.
Contribution
It introduces a multistage approach combining fine-tuned Whisper ASR and custom speaker diarization models, tailored for Bengali in challenging acoustic environments.
Findings
DER of 0.27 on private leaderboard
WER of 0.38 on private leaderboard
Effective use of fine-tuning and data strategies
Abstract
Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle, addressing the challenge of "who spoke when/what" in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Machine Learning and Data Classification
