Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

H.M. Shadman Tabib; Istiak Ahmmed Rifti; Abdullah Muhammed Amimul Ehsan; Somik Dasgupta; Md Zim Mim Siddiqee Sowdha; Abrar Jahin Sarker; Md. Rafiul Islam Nijamy; Tanvir Hossain; Mst. Metaly Khatun; Munzer Mahmood; Rakesh Debnath; Gourab Biswas; Asif Karim; Wahid Al Azad Navid; Masnoon Muztahid; Fuad Ahmed Udoy; Shahad Shahriar Rahman; Md. Tashdiqur Rahman Shifat; Most. Sonia Khatun; Mushfiqur Rahman; Md. Miraj Hasan; Anik Saha; Mohammad Ninad Mahmud Nobo; Soumik Bhattacharjee; Tusher Bhomik; Ahmmad Nur Swapnil; Shahriar Kabir

arXiv:2602.14291·cs.SD·February 17, 2026

Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

H.M. Shadman Tabib, Istiak Ahmmed Rifti, Abdullah Muhammed Amimul Ehsan, Somik Dasgupta, Md Zim Mim Siddiqee Sowdha, Abrar Jahin Sarker, Md. Rafiul Islam Nijamy, Tanvir Hossain, Mst. Metaly Khatun, Munzer Mahmood, Rakesh Debnath, Gourab Biswas, Asif Karim, Wahid Al Azad Navid

PDF

Open Access

TL;DR

Bengali-Loop introduces community benchmarks for long-form Bangla speech recognition and speaker diarization, providing datasets, evaluation protocols, and baselines to advance research in under-resourced Bengali language technology.

Contribution

This work presents the first large-scale, reproducible benchmarks for long-form Bangla ASR and speaker diarization, including datasets, annotation standards, and baseline results.

Findings

01

ASR baseline WER: 34.07%

02

Speaker diarization DER: 40.08%

03

Provides standardized evaluation protocols

Abstract

Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research