Low-latency Real-time Voice Conversion on CPU
Konstantine Sadov, Matthew Hutter, Asara Near

TL;DR
This paper introduces LLVC, a low-latency, resource-efficient neural network for real-time voice conversion that operates on consumer CPUs with under 20ms latency, outperforming existing models in speed and resource use.
Contribution
The paper presents LLVC, a novel neural network architecture for real-time voice conversion that achieves the lowest latency and resource usage among open-source models.
Findings
Latency under 20ms at 16kHz
Runs 2.8x faster than real-time on CPU
Lowest resource usage among open-source voice conversion models
Abstract
We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC (ow-latency ow-resource oice onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at https://github.com/KoeAI/LLVC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsKnowledge Distillation
