TFGAN: Time and Frequency Domain Based Generative Adversarial Network   for High-fidelity Speech Synthesis

Qiao Tian; Yi Chen; Zewang Zhang; Heng Lu; Linghui Chen; Lei Xie; Shan; Liu

arXiv:2011.12206·eess.AS·November 25, 2020·22 cites

TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis

Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, Shan, Liu

PDF

Open Access 1 Repo

TL;DR

TFGAN introduces a novel adversarial training approach in both time and frequency domains for speech synthesis, significantly improving speech quality while maintaining fast synthesis speed comparable to MelGAN.

Contribution

The paper proposes TFGAN, a new vocoder model that discriminates in both time and frequency domains and uses time-domain loss, enhancing speech quality over existing GAN-based methods.

Findings

01

TFGAN achieves higher speech fidelity than MelGAN.

02

TFGAN attains MOS comparable to autoregressive vocoders.

03

The method maintains real-time synthesis speed.

Abstract

Recently, GAN based speech synthesis methods, such as MelGAN, have become very popular. Compared to conventional autoregressive based methods, parallel structures based generators make waveform generation process fast and stable. However, the quality of generated speech by autoregressive based neural vocoders, such as WaveRNN, is still higher than GAN. To address this issue, we propose a novel vocoder model: TFGAN, which is adversarially learned both in time and frequency domain. On one hand, we propose to discriminate ground-truth waveform from synthetic one in frequency domain for offering more consistency guarantees instead of only in time domain. On the other hand, in contrast to the conventionally frequency-domain STFT loss approach or feature map loss by discriminator to learn waveform, we propose a set of time-domain loss that encourage the generator to capture the waveform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rishikksh20/tfgan
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Grouped Convolution · Sigmoid Activation · Tanh Activation · Residual Connection · Dilated Convolution · Average Pooling · Window-based Discriminator · HuMan(Expedia)||How do I get a human at Expedia?