SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice   Generation

Rongjie Huang; Chenye Cui; Feiyang Chen; Yi Ren; Jinglin Liu; Zhou; Zhao; Baoxing Huai; Zhefeng Wang

arXiv:2110.07468·eess.AS·August 8, 2022

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou, Zhao, Baoxing Huai, Zhefeng Wang

PDF

TL;DR

SingGAN is a novel GAN-based model that achieves high-fidelity, real-time singing voice synthesis with improved high-frequency detail and stability, surpassing previous neural vocoders in quality and efficiency.

Contribution

SingGAN introduces a new GAN architecture with adaptive feature learning and multi-scale discriminators for high-quality singing voice synthesis, the first of its kind.

Findings

01

Achieves state-of-the-art MOS of 4.05 in singing voice quality.

02

Operates at 50x real-time speed on a single GPU.

03

Generalizes well to unseen singers in mel-spectrogram inversion.

Abstract

Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings