IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi,, Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna, Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya, Sukumaran, Tripura Panchagnula, Sunjay Murali

TL;DR
This paper introduces INDICVOICES, a large, diverse multilingual speech dataset from India, along with an open-source blueprint for data collection, enabling the development of inclusive speech recognition models for 22 Indian languages.
Contribution
It provides a comprehensive, scalable data collection framework and the first multilingual ASR model supporting all 22 Indian languages.
Findings
Created 7348 hours of speech data from diverse speakers.
Developed INDICVOICES, a multilingual speech dataset for Indian languages.
Built IndicASR, the first ASR supporting all 22 Indian languages.
Abstract
We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
