An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu

TL;DR
This paper introduces novel multi-scale discriminators based on CQT and CWT for GAN vocoders, enhancing synthesis quality by better capturing pitch and transient features, applicable to speech and singing voices.
Contribution
It proposes two new multi-scale discriminators using CQT and CWT that improve GAN vocoder performance by modeling dynamic time-frequency features.
Findings
CQT discriminator excels in pitch modeling.
CWT discriminator captures short-time transients.
Combined discriminators improve synthesis quality across models.
Abstract
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems
