LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems
Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho George, Kaushal, Bhogale, Deovrat Mehendale, Mitesh M. Khapra

TL;DR
This paper introduces LAHAJA, a comprehensive Hindi ASR benchmark with diverse accents, and demonstrates that multilingual training improves model robustness, highlighting challenges in recognizing regional accents and specialized vocabulary.
Contribution
The creation of LAHAJA, a large multi-accent Hindi speech benchmark, and the evaluation of models showing the benefits of multilingual training for accent robustness.
Findings
Existing models perform poorly on LAHAJA.
Multilingual training improves ASR performance.
Performance drops for North-East and South Indian speakers.
Abstract
Hindi, one of the most spoken language of India, exhibits a diverse array of accents due to its usage among individuals from diverse linguistic origins. To enable a robust evaluation of Hindi ASR systems on multiple accents, we create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases, with a total of 12.5 hours of Hindi audio, sourced from 132 speakers spanning 83 districts of India. We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor. We then train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin. We also present a fine-grained analysis which shows that the performance declines for speakers from North-East and South India, especially with content heavy in named entities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
MethodsSparse Evolutionary Training
