823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

Ratnajit Dhar; Arpita Mallik

arXiv:2602.21183·cs.SD·February 25, 2026

823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

Ratnajit Dhar, Arpita Mallik

PDF

Open Access

TL;DR

This paper introduces a context-aware windowing approach for improved ASR and a fine-tuned speaker diarization system for Bengali long form audio, addressing underrepresented language challenges.

Contribution

It presents novel frameworks for Bengali speech recognition and speaker diarization, including a windowing strategy and a finetuned segmentation model tailored for Bengali conversational speech.

Findings

01

Effective long form Bengali speech transcription achieved.

02

Enhanced speaker diarization accuracy for Bengali conversations.

03

Scalable solutions for low-resource language speech technology.

Abstract

Bengali, despite being one of the most widely spoken languages globally, remains underrepresented in long form speech technology, particularly in systems addressing transcription and speaker attribution. We present frameworks for long form Bengali speech intelligence that address automatic speech recognition using a Whisper Medium based model and speaker diarization using a finetuned segmentation model. The ASR pipeline incorporates vocal separation, voice activity detection, and a gap aware windowing strategy to construct context preserving segments for stable decoding. For diarization, a pretrained speaker segmentation model is finetuned on the official competition dataset (provided as part of the DL Sprint 4.0 competition organized under BUET CSE Fest), to better capture Bengali conversational patterns. The resulting systems deliver both efficient transcription of long form audio and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing