WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training
Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu

TL;DR
This paper presents WeSinger 2, a fully parallel singing voice synthesis system that leverages adversarial training and multi-singer conditioning to produce highly natural singing voices efficiently, outperforming previous autoregressive models.
Contribution
It introduces a novel parallel SVS framework with generic discriminators and a combined spectrogram-F0 input for neural vocoders, enhancing expressiveness and efficiency.
Findings
Produces high-quality singing voices efficiently
Outperforms previous autoregressive models
Supports multi-singer synthesis with diverse timbres
Abstract
This paper aims to introduce a robust singing voice synthesis (SVS) system to produce very natural and realistic singing voices efficiently by leveraging the adversarial training strategy. On one hand, we designed simple but generic random area conditional discriminators to help supervise the acoustic model, which can effectively avoid the over-smoothed spectrogram prediction and improve the expressiveness of SVS. On the other hand, we subtly combined the spectrogram with the frame-level linearly-interpolated F0 sequence as the input for the neural vocoder, which is then optimized with the help of multiple adversarial conditional discriminators in the waveform domain and multi-scale distance functions in the frequency domain. The experimental results and ablation studies concluded that, compared with our previous auto-regressive work, our new system can produce high-quality singing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
