Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages

Ashwin Sankar; Sparsh Jain; Nikhil Narasimhan; Devilal Choudhary; Dhairya Suman; Mohammed Safi Ur Rahman Khan; Anoop Kunchukuttan; Mitesh M Khapra; Raj Dabre

arXiv:2411.04699·cs.CL·June 3, 2025

Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages

Ashwin Sankar, Sparsh Jain, Nikhil Narasimhan, Devilal Choudhary, Dhairya Suman, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M Khapra, Raj Dabre

PDF

Open Access 1 Repo 1 Models 5 Datasets

TL;DR

This paper introduces BhasaAnuvaad, the largest speech translation dataset for Indian languages, and develops IndicSeamless, a state-of-the-art model that significantly improves translation quality across 14 languages.

Contribution

The paper presents a large-scale, diverse speech translation dataset for Indian languages and a new model that advances the state-of-the-art in speech translation performance.

Findings

01

BhasaAnuvaad contains over 44,000 hours of audio and 17 million aligned segments.

02

IndicSeamless outperforms existing models on Indian language speech translation tasks.

03

Open-source release of data, code, and models to foster further research.

Abstract

Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we introduce BhasaAnuvaad, the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments across 14 Indian languages and English. Our dataset is built through a threefold methodology: (a) aggregating high-quality existing sources, (b) large-scale web crawling to ensure linguistic and domain diversity, and (c) creating synthetic data to model real-world speech disfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4bharat/bhasaanuvaad
pytorchOfficial

Models

🤗
ai4bharat/indic-seamless
model· 3.6k dl· ♡ 18
3.6k dl♡ 18

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Music and Audio Processing