Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN
Yin-Ping Cho, Yu Tsao, Hsin-Min Wang, and Yi-Wen Liu

TL;DR
This paper introduces a novel end-to-end Mandarin singing voice synthesis system combining diffusion denoising probabilistic models and Wasserstein GANs, achieving higher expressiveness and stable training without reconstruction constraints.
Contribution
It proposes a new acoustic model architecture integrating DDPM and WGAN for improved expressiveness and training stability in singing voice synthesis.
Findings
Enhanced musical expressiveness and high-frequency detail in synthesized singing.
Stable convergence of the adversarial acoustic model without reconstruction loss.
Outperforms previous methods in quality and stability on Mandarin singing data.
Abstract
Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining the diffusion denoising probabilistic model (DDPM) and \emph{Wasserstein} generative adversarial network (WGAN) to construct the backbone of the acoustic model. On top of the proposed acoustic model, a HiFi-GAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset. In the experiments, the proposed system exhibits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsHiFi-GAN · Convolution · Wasserstein GAN · Diffusion
