Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource   Languages

Kavitha Raju; Anjaly V; Ryan Lish; Joel Mathew

arXiv:2206.01205·eess.AS·May 24, 2023

Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages

Kavitha Raju, Anjaly V, Ryan Lish, Joel Mathew

PDF

Open Access 2 Datasets

TL;DR

This paper introduces a new open-licensed dataset of Bible audio recordings in low-resource northern Indian languages and evaluates baseline ASR models to facilitate future research in these underrepresented languages.

Contribution

It provides the first publicly available dataset of Bible audio recordings in low-resource northern Indian languages and establishes baseline ASR models for these languages.

Findings

01

Created and released a new low-resource language dataset

02

Trained and analyzed two baseline ASR models on the dataset

03

Provided experimental splits for future research

Abstract

Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Diverse Musicological Studies