Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

Isha Pandey; Pranav Gaikwad; Amruta Parulekar; Ganesh Ramakrishnan

arXiv:2507.16875·eess.AS·July 24, 2025

Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

Isha Pandey, Pranav Gaikwad, Amruta Parulekar, Ganesh Ramakrishnan

PDF

Open Access

TL;DR

This paper investigates how duration prediction affects speaker-specific text-to-speech synthesis in Indian languages, highlighting its importance for intelligibility and speaker consistency in low-resource multilingual settings.

Contribution

It provides a comparative analysis of duration prediction strategies in a non-autoregressive CNF-based speech model for Indian languages, emphasizing their impact on speech quality and speaker fidelity.

Findings

01

Duration predictors improve intelligibility in some languages.

02

Speaker-prompted predictors better preserve speaker identity.

03

Trade-offs exist between intelligibility and speaker consistency.

Abstract

High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing