VAANI: Capturing the language landscape for an inclusive digital India

Sujith Pulikodan; Abhayjeet Singh; Agneedh Basu; Nihar Desai; Pavan Kumar J; Pranav D Bhat; Raghu Dharmaraju; Ritika Gupta; Sathvik Udupa; Saurabh Kumar; Sumit Sharma; Vaibhav Vishwakarma; Visruth Sanka; Dinesh Tewari; Harsh Dhand; Amrita Kamat; Sukhwinder Singh; Shikhar Vashishth; Partha Talukdar; Raj Acharya; and Prasanta Kumar Ghosh

arXiv:2603.28714·eess.AS·April 1, 2026

VAANI: Capturing the language landscape for an inclusive digital India

Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu, Nihar Desai, Pavan Kumar J, Pranav D Bhat, Raghu Dharmaraju, Ritika Gupta, Sathvik Udupa, Saurabh Kumar, Sumit Sharma, Vaibhav Vishwakarma, Visruth Sanka, Dinesh Tewari, Harsh Dhand, Amrita Kamat, Sukhwinder Singh

PDF

10 Models 2 Datasets

TL;DR

Project VAANI has created a comprehensive, multi-modal dataset capturing India's linguistic diversity with over 31,270 hours of speech and 289K images across 112 languages, supporting inclusive AI development.

Contribution

It introduces a large-scale, multi-modal dataset representing India's diverse languages and dialects, with rigorous quality checks and open-source data for inclusive speech and multimodal AI research.

Findings

01

Collected 31,270 hours of speech data across 112 languages.

02

Open-sourced 289K images and 2,067 hours of transcribed speech.

03

First large-scale dataset representing many Indian languages.

Abstract

Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India's linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process that uses image-based prompts to encourage spontaneous responses. Images are captured through a separate process that encompasses a broad range of topics, gathered from both within and across districts. The collected data undergoes a rigorous multi-stage quality evaluation, including both automated and manual checks to ensure highest possible standards in audio quality and transcription accuracy. Following this thorough validation, we have open-sourced around 289K images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech, encompassing 112 languages from 165 districts from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.