QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion

Youngjun Sim; Jinsung Yoon; Wooyeol Jeong; Young-Joo Suh

arXiv:2411.16147·cs.SD·September 11, 2025

QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion

Youngjun Sim, Jinsung Yoon, Wooyeol Jeong, Young-Joo Suh

PDF

Open Access

TL;DR

This paper introduces QR-VC, a zero-shot voice conversion method that leverages quantization residuals and linear disentanglement to improve speech quality, intelligibility, and speaker similarity without complex models.

Contribution

It proposes a novel approach utilizing quantization residuals with linear projections for effective disentanglement in zero-shot voice conversion.

Findings

01

Outperforms existing methods in subjective and objective metrics.

02

Achieves higher intelligibility and speaker similarity.

03

Improves prosody preservation.

Abstract

Zero-shot voice conversion is a technique that alters the speaker identity of an input speech to match a target speaker using only a single reference utterance, without requiring additional training. Recent approaches extensively utilize self-supervised learning features with K-means quantization to extract high-quality content representations while removing speaker identity. However, this quantization process also eliminates fine-grained phonetic and prosodic variations, degrading intelligibility and prosody preservation. While prior works have primarily focused on quantized representations, quantization residuals remain underutilized and deserve further exploration. In this paper, we introduce a novel approach that fully utilizes quantization residuals by leveraging temporal properties of speech components. This facilitates the disentanglement of speaker identity and the recovery of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing