Pitch Preservation In Singing Voice Synthesis

Shujun Liu; Hai Zhu; Kun Wang; Huajun Wang

arXiv:2110.05033·cs.SD·October 13, 2021

Pitch Preservation In Singing Voice Synthesis

Shujun Liu, Hai Zhu, Kun Wang, Huajun Wang

PDF

Open Access

TL;DR

This paper introduces a novel singing voice synthesis model that disentangles pitch and phoneme information to improve pitch accuracy and synthesis quality, especially with limited training data.

Contribution

It proposes an acoustic model with independent pitch and phoneme encoders constrained by specific loss functions, enhancing the utilization of sparse data in singing voice synthesis.

Findings

01

Improved pitch synthesis accuracy.

02

Superior singing synthesis performance.

03

Effective disentanglement of pitch and phoneme features.

Abstract

Suffering from limited singing voice corpus, existing singing voice synthesis (SVS) methods that build encoder-decoder neural networks to directly generate spectrogram could lead to out-of-tune issues during the inference phase. To attenuate these issues, this paper presents a novel acoustic model with independent pitch encoder and phoneme encoder, which disentangles the phoneme and pitch information from music score to fully utilize the corpus. Specifically, according to equal temperament theory, the pitch encoder is constrained by a pitch metric loss that maps distances between adjacent input pitches into corresponding frequency multiples between the encoder outputs. For the phoneme encoder, based on the analysis that same phonemes corresponding to varying pitches can produce similar pronunciations, this encoder is followed by an adversarially trained pitch classifier to enforce the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing