WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Aurchi Chowdhury; Rubaiyat -E-Zaman; Sk. Ashrafuzzaman Nafees

arXiv:2603.04809·cs.SD·March 6, 2026

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees

PDF

Open Access

TL;DR

This paper introduces WhisperAlign and WhisperX-anchored diarization techniques tailored for long-form Bengali speech, improving transcription accuracy and speaker boundary detection in challenging multi-speaker scenarios.

Contribution

It presents a novel chunking strategy using whisper timestamps and domain-specific fine-tuning of diarization models for Bengali, enhancing performance in low-resource, long-form speech tasks.

Findings

01

Reduced Word Error Rate (WER) in Bengali ASR

02

Improved Diarization Error Rate (DER) for overlapping speakers

03

Effective handling of long-form, multi-speaker Bengali audio

Abstract

This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research