RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations

Ashwin Sankar; Yoach Lacombe; Sherry Thomas; Praveen Srinivasa Varadhan; Sanchit Gandhi; Mitesh M Khapra

arXiv:2505.18609·cs.CL·May 28, 2025

RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations

Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M Khapra

PDF

Open Access

TL;DR

RASMALAI is a comprehensive speech dataset for 23 Indian languages and English, enabling advanced controllable and expressive text-to-speech synthesis with rich attribute control.

Contribution

We created RASMALAI, a large-scale, richly annotated speech dataset, and developed IndicParlerTTS, the first open-source TTS system guided by text descriptions for Indian languages.

Findings

01

IndicParlerTTS produces high-quality, controllable speech synthesis.

02

The system reliably follows text descriptions and attributes.

03

It effectively transfers expressive features across languages.

Abstract

We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing