Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic   Wasserstein GAN

Yin-Ping Cho; Yu Tsao; Hsin-Min Wang; and Yi-Wen Liu

arXiv:2209.10446·eess.AS·September 22, 2022·1 cites

Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN

Yin-Ping Cho, Yu Tsao, Hsin-Min Wang, and Yi-Wen Liu

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end Mandarin singing voice synthesis system combining diffusion denoising probabilistic models and Wasserstein GANs, achieving higher expressiveness and stable training without reconstruction constraints.

Contribution

It proposes a new acoustic model architecture integrating DDPM and WGAN for improved expressiveness and training stability in singing voice synthesis.

Findings

01

Enhanced musical expressiveness and high-frequency detail in synthesized singing.

02

Stable convergence of the adversarial acoustic model without reconstruction loss.

03

Outperforms previous methods in quality and stability on Mandarin singing data.

Abstract

Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining the diffusion denoising probabilistic model (DDPM) and \emph{Wasserstein} generative adversarial network (WGAN) to construct the backbone of the acoustic model. On top of the proposed acoustic model, a HiFi-GAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset. In the experiments, the proposed system exhibits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsHiFi-GAN · Convolution · Wasserstein GAN · Diffusion