Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Tawsif Tashwar Dipto; Azmol Hossain; Rubayet Sabbir Faruque; Md. Rezuwan Hassan; Kanij Fatema; Tanmoy Shome; Ruwad Naswan; Md.Foriduzzaman Zihad; Mohaymen Ul Anam; Nazia Tasnim; Hasan Mahmud; Md Kamrul Hasan; Md. Mehedi Hasan Shawon; Farig Sadeque; Tahsin Reasat

arXiv:2510.23252·cs.CL·October 30, 2025

Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque, Md. Rezuwan Hassan, Kanij Fatema, Tanmoy Shome, Ruwad Naswan, Md.Foriduzzaman Zihad, Mohaymen Ul Anam, Nazia Tasnim, Hasan Mahmud, Md Kamrul Hasan, Md. Mehedi Hasan Shawon, Farig Sadeque, Tahsin Reasat

PDF

TL;DR

This paper evaluates the ability of speech foundation models to recognize regional dialects in low-resource languages, revealing significant challenges and proposing dialect-specific training as a solution.

Contribution

It introduces a new Bengali dialect speech dataset and demonstrates the limitations of current models in dialectal ASR, highlighting the need for dialect-specific approaches.

Findings

01

Speech foundation models perform poorly on dialectal ASR tasks.

02

Dialect-specific training improves recognition accuracy.

03

The dataset serves as an out-of-distribution benchmark for low-resource ASR.

Abstract

Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.