Rasa: Building Expressive Speech Synthesis Systems for Indian Languages   in Low-resource Settings

Praveen Srinivasa Varadhan; Ashwin Sankar; Giri Raju; Mitesh M. Khapra

arXiv:2407.14056·cs.CL·September 4, 2024

Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings

Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra

PDF

Open Access 1 Repo

TL;DR

This paper introduces Rasa, a multilingual expressive TTS dataset for Indian languages, demonstrating effective resource-efficient methods for expressive speech synthesis in low-resource settings.

Contribution

It provides the first multilingual expressive TTS dataset for Indian languages and offers practical insights into data requirements for high-quality expressive speech synthesis.

Findings

01

1 hour of expressive data suffices for a fair system

02

Increasing neutral data improves expressiveness

03

Pooling emotions enhances expressiveness

Abstract

We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AI4Bharat/Rasa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research