Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data
Srihari Bandarupalli, Bhavana Akkiraju, Charan Devarakonda, Vamsiraghusimha Narsinga, Anil Kumar Vuppala

TL;DR
This paper demonstrates that strategic cross-lingual pretraining on unlabeled speech data can significantly improve ASR performance for low-resource languages, reducing the need for large models and labeled datasets.
Contribution
It introduces a scalable pretraining approach using unlabeled data and morphologically-aware tokenization to develop effective ASR models for low-resource languages.
Findings
Achieved comparable performance with 300M model versus larger models.
Outperformed Whisper Large v3 on Persian ASR.
Showed data relevance and pretraining strategy are crucial for low-resource ASR.
Abstract
Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · Machine Learning and Data Classification
