Wasserstein GAN and Waveform Loss-based Acoustic Model Training for   Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Yi Zhao; Shinji Takaki; Hieu-Thi Luong; Junichi Yamagishi; Daisuke; Saito; Nobuaki Minematsu

arXiv:1807.11679·eess.AS·August 1, 2018·1 cites

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke, Saito, Nobuaki Minematsu

PDF

Open Access

TL;DR

This paper introduces a novel training framework for multi-speaker TTS acoustic models using Wasserstein GAN with gradient penalty and waveform loss, significantly improving speech quality and speaker similarity.

Contribution

It proposes integrating WGAN-GP and waveform loss into acoustic model training for multi-speaker TTS, enhancing naturalness and speaker consistency.

Findings

01

WGAN-GP with DML loss yields the best subjective quality.

02

The proposed method reduces mismatch between natural and predicted features.

03

Improves speaker similarity in multi-speaker TTS systems.

Abstract

Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsMixture of Logistic Distributions · Convolution · Dilated Causal Convolution · WaveNet · Dogecoin Customer Service Number +1-833-534-1729