Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments
Anusha Prakash, S Umesh, Hema A Murthy

TL;DR
This paper presents a novel approach combining signal processing cues with forced alignments to improve duration modeling in end-to-end TTS systems, resulting in high-quality speech synthesis for 13 Indian languages, especially in low-resource settings.
Contribution
It introduces a signal processing-aided alignment method that enhances duration accuracy and synthesis quality in multilingual TTS systems, outperforming existing approaches.
Findings
Systems outperform other alignment methods in low-resource scenarios.
Proposed approach achieves better synthesis quality than existing systems.
Outperforms current best TTS systems for 13 Indian languages.
Abstract
End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Max Pooling · Highway Layer · Highway Network · Residual Connection · Tanh Activation · Convolution · Batch Normalization · Bidirectional GRU
