Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Srihari Bandarupalli; Bhavana Akkiraju; Charan Devarakonda; Vamsiraghusimha Narsinga; Anil Kumar Vuppala

arXiv:2512.07277·cs.CL·December 9, 2025

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Srihari Bandarupalli, Bhavana Akkiraju, Charan Devarakonda, Vamsiraghusimha Narsinga, Anil Kumar Vuppala

PDF

Open Access

TL;DR

This paper demonstrates that strategic cross-lingual pretraining on unlabeled speech data can significantly improve ASR performance for low-resource languages, reducing the need for large models and labeled datasets.

Contribution

It introduces a scalable pretraining approach using unlabeled data and morphologically-aware tokenization to develop effective ASR models for low-resource languages.

Findings

01

Achieved comparable performance with 300M model versus larger models.

02

Outperformed Whisper Large v3 on Persian ASR.

03

Showed data relevance and pretraining strategy are crucial for low-resource ASR.

Abstract

Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · Machine Learning and Data Classification