Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis
Frederik Rautenberg, Jana Wiechmann, Petra Wagner, Reinhold Haeb-Umbach

TL;DR
This paper presents a speech synthesis system that effectively separates pitch and creak to modify voice quality without losing speaker identity, using a novel disentanglement approach with normalizing flows.
Contribution
It introduces a new method for disentangling pitch and creak in speech synthesis, enhancing speaker identity preservation during voice quality modifications.
Findings
Improved speaker verification accuracy across various creak manipulation levels.
Effective disentanglement of pitch and creak in speech synthesis.
Demonstrated robustness of the method in preserving speaker identity.
Abstract
We introduce a system capable of faithfully modifying the perceptual voice quality of creak while preserving the speaker's perceived identity. While it is well known that high creak probability is typically correlated with low pitch, it is important to note that this is a property observed on a population of speakers but does not necessarily hold across all situations. Disentanglement of pitch from creak is achieved by augmentation of the training dataset of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flow. The experiments show greatly improved speaker verification performance over a range of creak manipulation strengths.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
