ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional   Network

Yurii Rebryk; Stanislav Beliaev

arXiv:2005.07815·eess.AS·May 19, 2020·5 cites

ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network

Yurii Rebryk, Stanislav Beliaev

PDF

Open Access

TL;DR

ConVoice is a fast, convolutional neural network that performs real-time zero-shot voice style transfer without needing parallel data, leveraging pre-trained ASR and speaker embedding models.

Contribution

It introduces a fully convolutional, non-autoregressive neural network for zero-shot voice conversion that maintains high quality and speed.

Findings

01

Achieves comparable quality to state-of-the-art models

02

Operates in real-time with high speed

03

Handles speech of any length without quality loss

Abstract

We propose a neural network for zero-shot voice conversion (VC) without any parallel or transcribed data. Our approach uses pre-trained models for automatic speech recognition (ASR) and speaker embedding, obtained from a speaker verification task. Our model is fully convolutional and non-autoregressive except for a small pre-trained recurrent neural network for speaker encoding. ConVoice can convert speech of any length without compromising quality due to its convolutional architecture. Our model has comparable quality to similar state-of-the-art models while being extremely fast.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing