Self-supervised learning for robust voice cloning
Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios, Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June, Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

TL;DR
This paper introduces a self-supervised learning approach using BYOL to extract robust speech features for high-quality, multispeaker voice cloning that works effectively even with unseen speakers and noisy conditions.
Contribution
It applies self-supervised learning with novel audio augmentations to improve speaker identity capture and robustness in voice cloning without needing labeled speaker data.
Findings
High-quality voice cloning with unseen speakers achieved.
Features are robust to noise and acoustic variations.
Effective in unlabeled multispeaker datasets.
Abstract
Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Convolution · Highway Layer · Highway Network · Bidirectional GRU · Batch Normalization · Max Pooling · CBHG · Gated Recurrent Unit
