Pitch Preservation In Singing Voice Synthesis
Shujun Liu, Hai Zhu, Kun Wang, Huajun Wang

TL;DR
This paper introduces a novel singing voice synthesis model that disentangles pitch and phoneme information to improve pitch accuracy and synthesis quality, especially with limited training data.
Contribution
It proposes an acoustic model with independent pitch and phoneme encoders constrained by specific loss functions, enhancing the utilization of sparse data in singing voice synthesis.
Findings
Improved pitch synthesis accuracy.
Superior singing synthesis performance.
Effective disentanglement of pitch and phoneme features.
Abstract
Suffering from limited singing voice corpus, existing singing voice synthesis (SVS) methods that build encoder-decoder neural networks to directly generate spectrogram could lead to out-of-tune issues during the inference phase. To attenuate these issues, this paper presents a novel acoustic model with independent pitch encoder and phoneme encoder, which disentangles the phoneme and pitch information from music score to fully utilize the corpus. Specifically, according to equal temperament theory, the pitch encoder is constrained by a pitch metric loss that maps distances between adjacent input pitches into corresponding frequency multiples between the encoder outputs. For the phoneme encoder, based on the analysis that same phonemes corresponding to varying pitches can produce similar pronunciations, this encoder is followed by an adversarially trained pitch classifier to enforce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
