CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang,, YunLin Chen, Zhifei Li, Lei Xie

TL;DR
CoDiff-VC is an innovative end-to-end zero-shot voice conversion framework that combines speech codecs and diffusion models, significantly enhancing speech naturalness and speaker similarity without relying on pre-trained recognition models.
Contribution
The paper introduces CoDiff-VC, a novel framework integrating a speech codec and diffusion model with new techniques for content disentanglement and timbre modeling, advancing zero-shot voice conversion.
Findings
Improves speaker similarity in zero-shot VC
Generates more natural and high-quality speech
Outperforms existing methods in objective and subjective tests
Abstract
Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
