RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M Khapra

TL;DR
RASMALAI is a comprehensive speech dataset for 23 Indian languages and English, enabling advanced controllable and expressive text-to-speech synthesis with rich attribute control.
Contribution
We created RASMALAI, a large-scale, richly annotated speech dataset, and developed IndicParlerTTS, the first open-source TTS system guided by text descriptions for Indian languages.
Findings
IndicParlerTTS produces high-quality, controllable speech synthesis.
The system reliably follows text descriptions and attributes.
It effectively transfers expressive features across languages.
Abstract
We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
