WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer   Conditional Adversarial Training

Zewang Zhang; Yibin Zheng; Xinhui Li; Li Lu

arXiv:2207.01886·cs.SD·February 17, 2023·1 cites

WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training

Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu

PDF

Open Access

TL;DR

This paper presents WeSinger 2, a fully parallel singing voice synthesis system that leverages adversarial training and multi-singer conditioning to produce highly natural singing voices efficiently, outperforming previous autoregressive models.

Contribution

It introduces a novel parallel SVS framework with generic discriminators and a combined spectrogram-F0 input for neural vocoders, enhancing expressiveness and efficiency.

Findings

01

Produces high-quality singing voices efficiently

02

Outperforms previous autoregressive models

03

Supports multi-singer synthesis with diverse timbres

Abstract

This paper aims to introduce a robust singing voice synthesis (SVS) system to produce very natural and realistic singing voices efficiently by leveraging the adversarial training strategy. On one hand, we designed simple but generic random area conditional discriminators to help supervise the acoustic model, which can effectively avoid the over-smoothed spectrogram prediction and improve the expressiveness of SVS. On the other hand, we subtly combined the spectrogram with the frame-level linearly-interpolated F0 sequence as the input for the neural vocoder, which is then optimized with the help of multiple adversarial conditional discriminators in the waveform domain and multi-scale distance functions in the frequency domain. The experimental results and ablation studies concluded that, compared with our previous auto-regressive work, our new system can produce high-quality singing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis