RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding
Yisi Liu, Chenyang Wang, Hanjo Kim, Raniya Khan, Gopala Anumanchipalli

TL;DR
RT-VC is a novel real-time zero-shot voice conversion system that uses articulatory features and differentiable signal processing to achieve high-quality, low-latency voice transformation suitable for various applications.
Contribution
The paper introduces RT-VC, a zero-shot, real-time voice conversion framework utilizing articulatory features and DDSP for efficient, high-quality voice transformation with reduced latency.
Findings
Achieves 61.4 ms CPU latency, 13.3% faster than SOTA methods.
Maintains comparable synthesis quality to current state-of-the-art.
Utilizes articulatory features for robust and interpretable voice conversion.
Abstract
Voice conversion has emerged as a pivotal technology in numerous applications ranging from assistive communication to entertainment. In this paper, we present RT-VC, a zero-shot real-time voice conversion system that delivers ultra-low latency and high-quality performance. Our approach leverages an articulatory feature space to naturally disentangle content and speaker characteristics, facilitating more robust and interpretable voice transformations. Additionally, the integration of differentiable digital signal processing (DDSP) enables efficient vocoding directly from articulatory features, significantly reducing conversion latency. Experimental evaluations demonstrate that, while maintaining synthesis quality comparable to the current state-of-the-art (SOTA) method, RT-VC achieves a CPU latency of 61.4 ms, representing a 13.3\% reduction in latency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
