A2TTS: TTS for Low Resource Indian Languages
Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan

TL;DR
This paper introduces A2TTS, a diffusion-based, speaker-conditioned TTS system designed for low-resource Indian languages, capable of zero-shot speaker adaptation and improved prosody through reference audio conditioning.
Contribution
The paper presents a novel diffusion-based TTS architecture with cross-attention duration modeling and classifier-free guidance for zero-shot speaker adaptation in Indian languages.
Findings
Effective multispeaker TTS for Indian languages
Improved naturalness and speaker similarity
Zero-shot speaker generation capability
Abstract
We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
