Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Sanjid Hasan; Risalat Labib; A H M Fuad; Bayazid Hasan

arXiv:2602.23070·cs.SD·February 27, 2026

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper presents a comprehensive Bengali speech dataset and optimized methods for long-form speech recognition and speaker diarization, emphasizing data augmentation, fine-tuning, and heuristic post-processing to improve performance in low-resource settings.

Contribution

It introduces the Lipi-Ghor-882 dataset and demonstrates that targeted fine-tuning and heuristic post-processing significantly enhance long-form Bengali ASR and diarization.

Findings

01

Targeted fine-tuning with aligned annotations improves ASR accuracy.

02

Heuristic post-processing boosts speaker diarization performance.

03

Achieved a low 0.019 RTF for real-time processing.

Abstract

Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Sanjidh090/Lipi-Ghor-bn-882-SSTT
dataset· 4.7k dl
4.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing