QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion
Youngjun Sim, Jinsung Yoon, Wooyeol Jeong, Young-Joo Suh

TL;DR
This paper introduces QR-VC, a zero-shot voice conversion method that leverages quantization residuals and linear disentanglement to improve speech quality, intelligibility, and speaker similarity without complex models.
Contribution
It proposes a novel approach utilizing quantization residuals with linear projections for effective disentanglement in zero-shot voice conversion.
Findings
Outperforms existing methods in subjective and objective metrics.
Achieves higher intelligibility and speaker similarity.
Improves prosody preservation.
Abstract
Zero-shot voice conversion is a technique that alters the speaker identity of an input speech to match a target speaker using only a single reference utterance, without requiring additional training. Recent approaches extensively utilize self-supervised learning features with K-means quantization to extract high-quality content representations while removing speaker identity. However, this quantization process also eliminates fine-grained phonetic and prosodic variations, degrading intelligibility and prosody preservation. While prior works have primarily focused on quantized representations, quantization residuals remain underutilized and deserve further exploration. In this paper, we introduce a novel approach that fully utilizes quantization residuals by leveraging temporal properties of speech components. This facilitates the disentanglement of speaker identity and the recovery of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
