Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on   Generative Adversarial Network

Chunhui Wang; Chang Zeng; Xing He

arXiv:2210.14666·eess.AS·October 31, 2022·1 cites

Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

Chunhui Wang, Chang Zeng, Xing He

PDF

Open Access 1 Repo

TL;DR

XiaoiceSing2 enhances high-fidelity singing voice synthesis by employing a GAN with multi-band discriminators and improved Transformer blocks, effectively capturing detailed middle- and high-frequency components for superior audio quality.

Contribution

The paper introduces XiaoiceSing2, a novel SVS system that models full-band mel-spectrogram details using a GAN with specialized multi-band discriminators and an improved generator architecture.

Findings

01

Significant quality improvement over XiaoiceSing.

02

Effective modeling of middle- and high-frequency details.

03

Enhanced full-band mel-spectrogram synthesis.

Abstract

XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the details of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full-band mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block by adding multiple residual convolutional blocks in parallel with the self-attention block to balance the local and global features. The multi-band discriminator contains three sub-discriminators responsible for low-, middle-, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zengchang233/xiaoicesing2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Adam · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Label Smoothing · Dense Connections · Absolute Position Encodings · Layer Normalization