StreamVC: Real-Time Low-Latency Voice Conversion
Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George, Sung, Matthias Grundmann

TL;DR
StreamVC is a real-time, low-latency voice conversion system that maintains speech content and prosody while matching target voice timbre, suitable for mobile and live communication applications.
Contribution
It introduces a streaming voice conversion method leveraging SoundStream architecture for low-latency, high-quality synthesis on mobile devices, enabling real-time voice anonymization and communication.
Findings
Achieves low-latency waveform generation on mobile platforms.
Effectively preserves prosody and content during conversion.
Improves pitch stability without leaking source timbre.
Abstract
We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
