Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust   Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Ha-Yeong Choi; Sang-Hoon Lee; Seong-Whan Lee

arXiv:2311.04693·eess.AS·November 9, 2023·1 cites

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee

PDF

Open Access 1 Repo

TL;DR

Diff-HierVC introduces a hierarchical diffusion-based voice conversion system that enhances pitch accuracy and zero-shot speaker adaptation by using a novel pitch generator, a disentangled speech representation, and masked priors.

Contribution

The paper proposes a novel hierarchical diffusion model for voice conversion that improves pitch accuracy and zero-shot speaker adaptation capabilities.

Findings

01

Superior pitch generation and voice style transfer performance.

02

Achieves 0.83% CER and 3.29% EER in zero-shot scenarios.

03

Effective disentanglement of speech features enhances conversion quality.

Abstract

Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hayeong0/Diff-HierVC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion