Investigating self-supervised features for expressive, multilingual voice conversion

\'Alvaro Mart\'in-Cortinas; Daniel S\'aez-Trigueros; Grzegorz Beringer; Iv\'an Vall\'es-P\'erez; Roberto Barra-Chicote; Biel Tura-Vecino; Adam Gabry\'s; Piotr Bilinski; Thomas Merritt; Jaime Lorenzo-Trueba

arXiv:2505.08278·eess.AS·May 14, 2025·ICASSP

Investigating self-supervised features for expressive, multilingual voice conversion

\'Alvaro Mart\'in-Cortinas, Daniel S\'aez-Trigueros, Grzegorz Beringer, Iv\'an Vall\'es-P\'erez, Roberto Barra-Chicote, Biel Tura-Vecino, Adam Gabry\'s, Piotr Bilinski, Thomas Merritt, Jaime Lorenzo-Trueba

PDF

TL;DR

This paper explores self-supervised learning for voice conversion, enabling high-quality, zero-shot conversion that preserves prosody and content without requiring parallel data, advancing the field of multilingual voice synthesis.

Contribution

It introduces a novel approach combining SSL representations with speaker embeddings for effective voice conversion, improving zero-shot performance and prosody preservation.

Findings

01

Zero-shot conversion maintains prosody and content.

02

Comparable speaker similarity to PPG-based systems.

03

No need for parallel training data.

Abstract

Voice conversion (VC) systems are widely used for several applications, from speaker anonymisation to personalised speech synthesis. Supervised approaches learn a mapping between different speakers using parallel data, which is expensive to produce. Unsupervised approaches are typically trained to reconstruct the input signal, which is composed of the content and the speaker information. Disentangling these components is a challenge and often leads to speaker leakage or prosodic information removal. In this paper, we explore voice conversion by leveraging the potential of self-supervised learning (SSL). A combination of the latent representations of SSL models, concatenated with speaker embeddings, is fed to a vocoder which is trained to reconstruct the input. Zero-shot voice conversion results show that this approach allows to keep the prosody and content of the source speaker while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.