WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

Zewang Zhang; Yibin Zheng; Xinhui Li; Li Lu

arXiv:2203.10750·cs.SD·June 28, 2022

WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu

PDF

Open Access

TL;DR

WeSinger is a novel multi-singer Chinese neural singing voice synthesis system that leverages data augmentation, advanced models, and high-quality vocoding to achieve state-of-the-art naturalness and accuracy.

Contribution

It introduces a comprehensive SVS system combining 24 kHz LPCNet vocoder, multi-singer pre-training, and novel modules for improved singing synthesis.

Findings

01

Achieves state-of-the-art performance on Opencpop corpus

02

Demonstrates high naturalness and accuracy in synthesis

03

First to combine 24 kHz LPCNet with multi-singer pre-training

Abstract

In this paper, we develop a new multi-singer Chinese neural singing voice synthesis (SVS) system named WeSinger. To improve the accuracy and naturalness of synthesized singing voice, we design several specifical modules and techniques: 1) A deep bi-directional LSTM-based duration model with multi-scale rhythm loss and post-processing step; 2) A Transformer-alike acoustic model with progressive pitch-weighted decoder loss; 3) a 24 kHz pitch-aware LPCNet neural vocoder to produce high-quality singing waveforms; 4) A novel data augmentation method with multi-singer pre-training for stronger robustness and naturalness. To our knowledge, WeSinger is the first SVS system to adopt 24 kHz LPCNet and multi-singer pre-training simultaneously. Both quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness, and WeSinger achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory