TL;DR
Multi-SpectroGAN introduces a novel adversarial training method for speech synthesis that generates high-diversity, high-fidelity spectrograms by learning style embeddings and combining styles without reconstruction loss.
Contribution
The paper proposes Multi-SpectroGAN, a new GAN-based TTS model that trains solely with adversarial feedback and introduces adversarial style combination for style generalization.
Findings
Achieves high naturalness in spectrogram synthesis comparable to ground-truth.
Generates diverse speech styles by controlling style embeddings.
Outperforms existing models in style generalization and spectrogram quality.
Abstract
While generative adversarial networks (GANs) based neural text-to-speech (TTS) systems have shown significant improvement in neural speech synthesis, there is no TTS system to learn to synthesize speech from text sequences with only adversarial feedback. Because adversarial feedback alone is not sufficient to train the generator, current models still require the reconstruction loss compared with the ground-truth and the generated mel-spectrogram directly. In this paper, we present Multi-SpectroGAN (MSG), which can train the multi-speaker model with only the adversarial feedback by conditioning a self-supervised hidden representation of the generator to a conditional discriminator. This leads to better guidance for generator training. Moreover, we also propose adversarial style combination (ASC) for better generalization in the unseen speaking style and transcript, which can learn latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
