TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis
Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Linghui Chen, Lei Xie, Shan, Liu

TL;DR
TFGAN introduces a novel adversarial training approach in both time and frequency domains for speech synthesis, significantly improving speech quality while maintaining fast synthesis speed comparable to MelGAN.
Contribution
The paper proposes TFGAN, a new vocoder model that discriminates in both time and frequency domains and uses time-domain loss, enhancing speech quality over existing GAN-based methods.
Findings
TFGAN achieves higher speech fidelity than MelGAN.
TFGAN attains MOS comparable to autoregressive vocoders.
The method maintains real-time synthesis speed.
Abstract
Recently, GAN based speech synthesis methods, such as MelGAN, have become very popular. Compared to conventional autoregressive based methods, parallel structures based generators make waveform generation process fast and stable. However, the quality of generated speech by autoregressive based neural vocoders, such as WaveRNN, is still higher than GAN. To address this issue, we propose a novel vocoder model: TFGAN, which is adversarially learned both in time and frequency domain. On one hand, we propose to discriminate ground-truth waveform from synthetic one in frequency domain for offering more consistency guarantees instead of only in time domain. On the other hand, in contrast to the conventionally frequency-domain STFT loss approach or feature map loss by discriminator to learn waveform, we propose a set of time-domain loss that encourage the generator to capture the waveform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Grouped Convolution · Sigmoid Activation · Tanh Activation · Residual Connection · Dilated Convolution · Average Pooling · Window-based Discriminator · HuMan(Expedia)||How do I get a human at Expedia?
