Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis
Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi

TL;DR
This paper investigates various pretraining strategies, neural vocoders, and acoustic configurations to enhance zero-shot multi-speaker end-to-end speech synthesis, demonstrating improvements in naturalness, speaker similarity, and efficiency.
Contribution
It introduces effective pretraining and fine-tuning methods, compares neural vocoders, and evaluates acoustic settings for improved multi-speaker speech synthesis.
Findings
Fine-tuning on audiobook data improves naturalness and speaker similarity.
Listeners can distinguish between 16kHz and 24kHz sampling rates.
WaveRNN offers comparable quality to WaveNet with faster inference.
Abstract
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTanh Activation · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · WaveRNN · Dilated Causal Convolution · Mixture of Logistic Distributions · WaveNet
