IndicVoices: Towards building an Inclusive Multilingual Speech Dataset   for Indian Languages

Tahir Javed; Janki Atul Nawale; Eldho Ittan George; Sakshi Joshi,; Kaushal Santosh Bhogale; Deovrat Mehendale; Ishvinder Virender Sethi; Aparna; Ananthanarayanan; Hafsah Faquih; Pratiti Palit; Sneha Ravishankar; Saranya; Sukumaran; Tripura Panchagnula; Sunjay Murali; Kunal Sharad Gandhi,; Ambujavalli R; Manickam K M; C Venkata Vaijayanthi; Krishnan Srinivasa; Raghavan Karunganni; Pratyush Kumar; Mitesh M Khapra

arXiv:2403.01926·cs.CL·March 5, 2024·1 cites

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi,, Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna, Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya, Sukumaran, Tripura Panchagnula, Sunjay Murali

PDF

Open Access 1 Datasets

TL;DR

This paper introduces INDICVOICES, a large, diverse multilingual speech dataset from India, along with an open-source blueprint for data collection, enabling the development of inclusive speech recognition models for 22 Indian languages.

Contribution

It provides a comprehensive, scalable data collection framework and the first multilingual ASR model supporting all 22 Indian languages.

Findings

01

Created 7348 hours of speech data from diverse speakers.

02

Developed INDICVOICES, a multilingual speech dataset for Indian languages.

03

Built IndicASR, the first ASR supporting all 22 Indian languages.

Abstract

We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ai4bharat/IndicVoices
dataset· 9.4k dl
9.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques