XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis   System

Peiling Lu; Jie Wu; Jian Luan; Xu Tan; Li Zhou

arXiv:2006.06261·eess.AS·June 12, 2020·20 cites

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou

PDF

Open Access

TL;DR

XiaoiceSing is an integrated singing voice synthesis system that improves sound quality, pronunciation accuracy, and naturalness by incorporating musical score features and specialized modeling techniques.

Contribution

It introduces singing-specific design enhancements to FastSpeech architecture, including musical score features and residual F0 prediction, for superior singing voice synthesis.

Findings

01

XiaoiceSing outperforms baseline CNN systems by 1.44 MOS in sound quality.

02

Achieves 97.3% preference in F0 modeling and 84.3% in duration modeling in A/B tests.

03

Demonstrates significant improvements in pronunciation accuracy and naturalness.

Abstract

This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement. Experiment results show that XiaoiceSing outperforms the baseline system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests, the proposed F0 and duration modeling methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsResidual Connection