Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
Isha Pandey, Pranav Gaikwad, Amruta Parulekar, Ganesh Ramakrishnan

TL;DR
This paper investigates how duration prediction affects speaker-specific text-to-speech synthesis in Indian languages, highlighting its importance for intelligibility and speaker consistency in low-resource multilingual settings.
Contribution
It provides a comparative analysis of duration prediction strategies in a non-autoregressive CNF-based speech model for Indian languages, emphasizing their impact on speech quality and speaker fidelity.
Findings
Duration predictors improve intelligibility in some languages.
Speaker-prompted predictors better preserve speaker identity.
Trade-offs exist between intelligibility and speaker consistency.
Abstract
High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
