VAANI: Capturing the language landscape for an inclusive digital India
Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu, Nihar Desai, Pavan Kumar J, Pranav D Bhat, Raghu Dharmaraju, Ritika Gupta, Sathvik Udupa, Saurabh Kumar, Sumit Sharma, Vaibhav Vishwakarma, Visruth Sanka, Dinesh Tewari, Harsh Dhand, Amrita Kamat, Sukhwinder Singh

TL;DR
Project VAANI has created a comprehensive, multi-modal dataset capturing India's linguistic diversity with over 31,270 hours of speech and 289K images across 112 languages, supporting inclusive AI development.
Contribution
It introduces a large-scale, multi-modal dataset representing India's diverse languages and dialects, with rigorous quality checks and open-source data for inclusive speech and multimodal AI research.
Findings
Collected 31,270 hours of speech data across 112 languages.
Open-sourced 289K images and 2,067 hours of transcribed speech.
First large-scale dataset representing many Indian languages.
Abstract
Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India's linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process that uses image-based prompts to encourage spontaneous responses. Images are captured through a separate process that encompasses a broad range of topics, gathered from both within and across districts. The collected data undergoes a rigorous multi-stage quality evaluation, including both automated and manual checks to ensure highest possible standards in audio quality and transcription accuracy. Following this thorough validation, we have open-sourced around 289K images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech, encompassing 112 languages from 165 districts from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ARTPARK-IISc/Vaani-FastConformer-Multilingualmodel· 922 dl· ♡ 7922 dl♡ 7
- 🤗ARTPARK-IISc/whisper-large-v3-vaani-hindimodel· 2.8k dl· ♡ 112.8k dl♡ 11
- 🤗ARTPARK-IISc/whisper-large-v3-vaani-telugumodel· 69 dl69 dl
- 🤗ARTPARK-IISc/whisper-large-v3-vaani-odiamodel· 26 dl26 dl
- 🤗ARTPARK-IISc/Vaani-FastConformer-Hindimodel· 79 dl79 dl
- 🤗ARTPARK-IISc/Vaani-FastConformer-Kannadamodel
- 🤗ARTPARK-IISc/Vaani-FastConformer-Telugumodel· 4 dl4 dl
- 🤗ARTPARK-IISc/Vaani-FastConformer-Odiamodel· 5 dl· ♡ 15 dl♡ 1
- 🤗ARTPARK-IISc/Vaani-LID_v0model· 215 dl· ♡ 1215 dl♡ 1
- 🤗ARTPARK-IISc/Vaani-FastConformer-Malayalammodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
