Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy   Loss

Jiatong Shi; Shuai Guo; Nan Huo; Yuekai Zhang; Qin Jin

arXiv:2010.12024·eess.AS·March 1, 2021

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Perceptual Entropy loss based on psycho-acoustic models to regularize sequence-to-sequence singing voice synthesis systems, effectively reducing overfitting and enhancing synthesis quality with limited data.

Contribution

It proposes a novel PE loss function for SVS that improves model generalization and singing quality across various sequence-to-sequence architectures.

Findings

01

PE loss mitigates overfitting in SVS models

02

Significant improvement in synthesized singing quality

03

Effective across RNN, transformer, and conformer models

Abstract

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation costs. In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SJTMusicTeam/SVS_system
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing