A2TTS: TTS for Low Resource Indian Languages

Ayush Singh Bhadoriya; Abhishek Nikunj Shinde; Isha Pandey; Ganesh Ramakrishnan

arXiv:2507.15272·cs.SD·July 22, 2025

A2TTS: TTS for Low Resource Indian Languages

Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan

PDF

TL;DR

This paper introduces A2TTS, a diffusion-based, speaker-conditioned TTS system designed for low-resource Indian languages, capable of zero-shot speaker adaptation and improved prosody through reference audio conditioning.

Contribution

The paper presents a novel diffusion-based TTS architecture with cross-attention duration modeling and classifier-free guidance for zero-shot speaker adaptation in Indian languages.

Findings

01

Effective multispeaker TTS for Indian languages

02

Improved naturalness and speaker similarity

03

Zero-shot speaker generation capability

Abstract

We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.