TL;DR
X-VC introduces a zero-shot streaming voice conversion system that operates in the neural codec space, enabling high-quality, low-latency, speaker-independent conversion suitable for interactive applications.
Contribution
The paper proposes a novel codec-space one-step conversion method with dual-conditioning and adaptive normalization, improving zero-shot streaming VC performance.
Findings
Achieves the best streaming WER in English and Chinese
Demonstrates strong speaker similarity in various settings
Offers lower real-time factor than baselines
Abstract
Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
