Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis
Beata Lorincz, Adriana Stan, Mircea Giurgiu

TL;DR
This paper introduces a speaker verification-based loss and waveform data augmentation techniques to improve multispeaker speech synthesis, especially with limited data, enhancing speaker similarity and intelligibility.
Contribution
It proposes novel loss and data augmentation methods that enhance speaker representation and speech quality in low-data multispeaker TTS systems.
Findings
The additional loss improves speaker similarity.
Data augmentation enhances speech intelligibility.
Both methods are effective based on objective and subjective evaluations.
Abstract
Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
