SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation
Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou, Zhao, Baoxing Huai, Zhefeng Wang

TL;DR
SingGAN is a novel GAN-based model that achieves high-fidelity, real-time singing voice synthesis with improved high-frequency detail and stability, surpassing previous neural vocoders in quality and efficiency.
Contribution
SingGAN introduces a new GAN architecture with adaptive feature learning and multi-scale discriminators for high-quality singing voice synthesis, the first of its kind.
Findings
Achieves state-of-the-art MOS of 4.05 in singing voice quality.
Operates at 50x real-time speed on a single GPU.
Generalizes well to unseen singers in mel-spectrogram inversion.
Abstract
Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
