Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram   Generation with Adversarial Style Combination for Speech Synthesis

Sang-Hoon Lee; Hyun-Wook Yoon; Hyeong-Rae Noh; Ji-Hoon Kim; Seong-Whan; Lee

arXiv:2012.07267·eess.AS·December 15, 2020·AAAI

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh, Ji-Hoon Kim, Seong-Whan, Lee

PDF

1 Video

TL;DR

Multi-SpectroGAN introduces a novel adversarial training method for speech synthesis that generates high-diversity, high-fidelity spectrograms by learning style embeddings and combining styles without reconstruction loss.

Contribution

The paper proposes Multi-SpectroGAN, a new GAN-based TTS model that trains solely with adversarial feedback and introduces adversarial style combination for style generalization.

Findings

01

Achieves high naturalness in spectrogram synthesis comparable to ground-truth.

02

Generates diverse speech styles by controlling style embeddings.

03

Outperforms existing models in style generalization and spectrogram quality.

Abstract

While generative adversarial networks (GANs) based neural text-to-speech (TTS) systems have shown significant improvement in neural speech synthesis, there is no TTS system to learn to synthesize speech from text sequences with only adversarial feedback. Because adversarial feedback alone is not sufficient to train the generator, current models still require the reconstruction loss compared with the ground-truth and the generated mel-spectrogram directly. In this paper, we present Multi-SpectroGAN (MSG), which can train the multi-speaker model with only the adversarial feedback by conditioning a self-supervised hidden representation of the generator to a conditional discriminator. This leads to better guidance for generator training. Moreover, we also propose adversarial style combination (ASC) for better generalization in the unseen speaking style and transcript, which can learn latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis· underline