XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou

TL;DR
XiaoiceSing is an integrated singing voice synthesis system that improves sound quality, pronunciation accuracy, and naturalness by incorporating musical score features and specialized modeling techniques.
Contribution
It introduces singing-specific design enhancements to FastSpeech architecture, including musical score features and residual F0 prediction, for superior singing voice synthesis.
Findings
XiaoiceSing outperforms baseline CNN systems by 1.44 MOS in sound quality.
Achieves 97.3% preference in F0 modeling and 84.3% in duration modeling in A/B tests.
Demonstrates significant improvements in pronunciation accuracy and naturalness.
Abstract
This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement. Experiment results show that XiaoiceSing outperforms the baseline system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests, the proposed F0 and duration modeling methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsResidual Connection
