Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss
Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin

TL;DR
This paper introduces a Perceptual Entropy loss based on psycho-acoustic models to regularize sequence-to-sequence singing voice synthesis systems, effectively reducing overfitting and enhancing synthesis quality with limited data.
Contribution
It proposes a novel PE loss function for SVS that improves model generalization and singing quality across various sequence-to-sequence architectures.
Findings
PE loss mitigates overfitting in SVS models
Significant improvement in synthesized singing quality
Effective across RNN, transformer, and conformer models
Abstract
The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation costs. In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
