823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio
Ratnajit Dhar, Arpita Mallik

TL;DR
This paper introduces a context-aware windowing approach for improved ASR and a fine-tuned speaker diarization system for Bengali long form audio, addressing underrepresented language challenges.
Contribution
It presents novel frameworks for Bengali speech recognition and speaker diarization, including a windowing strategy and a finetuned segmentation model tailored for Bengali conversational speech.
Findings
Effective long form Bengali speech transcription achieved.
Enhanced speaker diarization accuracy for Bengali conversations.
Scalable solutions for low-resource language speech technology.
Abstract
Bengali, despite being one of the most widely spoken languages globally, remains underrepresented in long form speech technology, particularly in systems addressing transcription and speaker attribution. We present frameworks for long form Bengali speech intelligence that address automatic speech recognition using a Whisper Medium based model and speaker diarization using a finetuned segmentation model. The ASR pipeline incorporates vocal separation, voice activity detection, and a gap aware windowing strategy to construct context preserving segments for stable decoding. For diarization, a pretrained speaker segmentation model is finetuned on the official competition dataset (provided as part of the DL Sprint 4.0 competition organized under BUET CSE Fest), to better capture Bengali conversational patterns. The resulting systems deliver both efficient transcription of long form audio and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
