A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS
Haohan Guo, Hui Lu, Xixin Wu, Helen Meng

TL;DR
This paper introduces a multi-scale time-frequency spectrogram discriminator for GAN-based non-autoregressive TTS, improving the quality and fidelity of generated speech by capturing detailed spectrogram features at multiple scales.
Contribution
It proposes a novel multi-scale spectrogram discriminator using a U-Net structure to enhance GAN-based NAR-TTS systems, capturing both coarse and fine details in speech spectrograms.
Findings
Significant improvement in speech naturalness and fidelity.
Multi-scale and time-frequency discriminators outperform single-scale approaches.
Enhanced visualizations confirm the effectiveness of multi-scale discriminating.
Abstract
The generative adversarial network (GAN) has shown its outstanding capability in improving Non-Autoregressive TTS (NAR-TTS) by adversarially training it with an extra model that discriminates between the real and the generated speech. To maximize the benefits of GAN, it is crucial to find a powerful discriminator that can capture rich distinguishable information. In this paper, we propose a multi-scale time-frequency spectrogram discriminator to help NAR-TTS generate high-fidelity Mel-spectrograms. It treats the spectrogram as a 2D image to exploit the correlation among different components in the time-frequency domain. And a U-Net-based model structure is employed to discriminate at different scales to capture both coarse-grained and fine-grained information. We conduct subjective tests to evaluate the proposed approach. Both multi-scale and time-frequency discriminating bring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Machine Fault Diagnosis Techniques · Digital Media Forensic Detection
