CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice   Conversion

Yuke Li; Xinfa Zhu; Hanzhao Li; JiXun Yao; WenJie Tian; XiPeng Yang,; YunLin Chen; Zhifei Li; Lei Xie

arXiv:2411.18918·cs.SD·December 4, 2024

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang,, YunLin Chen, Zhifei Li, Lei Xie

PDF

Open Access

TL;DR

CoDiff-VC is an innovative end-to-end zero-shot voice conversion framework that combines speech codecs and diffusion models, significantly enhancing speech naturalness and speaker similarity without relying on pre-trained recognition models.

Contribution

The paper introduces CoDiff-VC, a novel framework integrating a speech codec and diffusion model with new techniques for content disentanglement and timbre modeling, advancing zero-shot voice conversion.

Findings

01

Improves speaker similarity in zero-shot VC

02

Generates more natural and high-quality speech

03

Outperforms existing methods in objective and subjective tests

Abstract

Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing