Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch
Teodora R\u{a}gman, Adriana Stan

TL;DR
This paper adapts FastPitch for Romanian speech synthesis, expanding speaker diversity and enabling synthesis of both known and unseen speakers, with a focus on naturalness and anonymization.
Contribution
It introduces a new FastPitch-based configuration for Romanian speech synthesis that handles multiple speakers, including anonymous and unseen identities, with improved naturalness.
Findings
Effective adaptation of FastPitch to Romanian language
Capability to synthesize speech for 18 speakers and unseen identities
Potential for anonymized speech synthesis
Abstract
This paper focuses on adapting the functionalities of the FastPitch model to the Romanian language; extending the set of speakers from one to eighteen; synthesising speech using an anonymous identity; and replicating the identities of new, unseen speakers. During this work, the effects of various configurations and training strategies were tested and discussed, along with their advantages and weaknesses. Finally, we settled on a new configuration, built on top of the FastPitch architecture, capable of producing natural speech synthesis, for both known (identities from the training dataset) and unknown (identities learnt through short reference samples) speakers. The anonymous speaker can be used for text-to-speech synthesis, if one wants to cancel out the identity information while keeping the semantic content whole and clear. At last, we discussed possible limitations of our work,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsAttention Is All You Need · Sparse Evolutionary Training · Dense Connections · Residual Connection · Linear Layer · Convolution · Softmax · Multi-Head Attention · Layer Normalization · FastPitch
