ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network
Yurii Rebryk, Stanislav Beliaev

TL;DR
ConVoice is a fast, convolutional neural network that performs real-time zero-shot voice style transfer without needing parallel data, leveraging pre-trained ASR and speaker embedding models.
Contribution
It introduces a fully convolutional, non-autoregressive neural network for zero-shot voice conversion that maintains high quality and speed.
Findings
Achieves comparable quality to state-of-the-art models
Operates in real-time with high speed
Handles speech of any length without quality loss
Abstract
We propose a neural network for zero-shot voice conversion (VC) without any parallel or transcribed data. Our approach uses pre-trained models for automatic speech recognition (ASR) and speaker embedding, obtained from a speaker verification task. Our model is fully convolutional and non-autoregressive except for a small pre-trained recurrent neural network for speaker encoding. ConVoice can convert speech of any length without compromising quality due to its convolutional architecture. Our model has comparable quality to similar state-of-the-art models while being extremely fast.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
