Pretraining Strategies, Waveform Model Choice, and Acoustic   Configurations for Multi-Speaker End-to-End Speech Synthesis

Erica Cooper; Xin Wang; Yi Zhao; Yusuke Yasuda; Junichi Yamagishi

arXiv:2011.04839·cs.SD·November 11, 2020·1 cites

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper investigates various pretraining strategies, neural vocoders, and acoustic configurations to enhance zero-shot multi-speaker end-to-end speech synthesis, demonstrating improvements in naturalness, speaker similarity, and efficiency.

Contribution

It introduces effective pretraining and fine-tuning methods, compares neural vocoders, and evaluates acoustic settings for improved multi-speaker speech synthesis.

Findings

01

Fine-tuning on audiobook data improves naturalness and speaker similarity.

02

Listeners can distinguish between 16kHz and 24kHz sampling rates.

03

WaveRNN offers comparable quality to WaveNet with faster inference.

Abstract

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTanh Activation · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · WaveRNN · Dilated Causal Convolution · Mixture of Logistic Distributions · WaveNet