Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network
Chunhui Wang, Chang Zeng, Xing He

TL;DR
XiaoiceSing2 enhances high-fidelity singing voice synthesis by employing a GAN with multi-band discriminators and improved Transformer blocks, effectively capturing detailed middle- and high-frequency components for superior audio quality.
Contribution
The paper introduces XiaoiceSing2, a novel SVS system that models full-band mel-spectrogram details using a GAN with specialized multi-band discriminators and an improved generator architecture.
Findings
Significant quality improvement over XiaoiceSing.
Effective modeling of middle- and high-frequency details.
Enhanced full-band mel-spectrogram synthesis.
Abstract
XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the details of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full-band mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block by adding multiple residual convolutional blocks in parallel with the self-attention block to balance the local and global features. The multi-band discriminator contains three sub-discriminators responsible for low-, middle-, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Label Smoothing · Dense Connections · Absolute Position Encodings · Layer Normalization
