Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian   Languages with Signal Processing aided Alignments

Anusha Prakash; S Umesh; Hema A Murthy

arXiv:2210.17153·eess.AS·September 19, 2024·1 cites

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

Anusha Prakash, S Umesh, Hema A Murthy

PDF

Open Access

TL;DR

This paper presents a novel approach combining signal processing cues with forced alignments to improve duration modeling in end-to-end TTS systems, resulting in high-quality speech synthesis for 13 Indian languages, especially in low-resource settings.

Contribution

It introduces a signal processing-aided alignment method that enhances duration accuracy and synthesis quality in multilingual TTS systems, outperforming existing approaches.

Findings

01

Systems outperform other alignment methods in low-resource scenarios.

02

Proposed approach achieves better synthesis quality than existing systems.

03

Outperforms current best TTS systems for 13 Indian languages.

Abstract

End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Max Pooling · Highway Layer · Highway Network · Residual Connection · Tanh Activation · Convolution · Batch Normalization · Bidirectional GRU