Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages
Kavitha Raju, Anjaly V, Ryan Lish, Joel Mathew

TL;DR
This paper introduces a new open-licensed dataset of Bible audio recordings in low-resource northern Indian languages and evaluates baseline ASR models to facilitate future research in these underrepresented languages.
Contribution
It provides the first publicly available dataset of Bible audio recordings in low-resource northern Indian languages and establishes baseline ASR models for these languages.
Findings
Created and released a new low-resource language dataset
Trained and analyzed two baseline ASR models on the dataset
Provided experimental splits for future research
Abstract
Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Diverse Musicological Studies
