GlowVC: Mel-spectrogram space disentangling model for   language-independent text-free voice conversion

Magdalena Proszewska; Grzegorz Beringer; Daniel S\'aez-Trigueros,; Thomas Merritt; Abdelhamid Ezzerg; Roberto Barra-Chicote

arXiv:2207.01454·eess.AS·July 5, 2022

GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

Magdalena Proszewska, Grzegorz Beringer, Daniel S\'aez-Trigueros,, Thomas Merritt, Abdelhamid Ezzerg, Roberto Barra-Chicote

PDF

Open Access

TL;DR

GlowVC introduces a flow-based, language-independent voice conversion model that disentangles content, pitch, and speaker features in mel-spectrograms, achieving superior intelligibility and naturalness across multiple languages.

Contribution

The paper presents GlowVC, a novel flow-based model for multilingual, text-free voice conversion that effectively disentangles speech features without relying on linguistic inputs during inference.

Findings

01

GlowVC outperforms AutoVC in intelligibility.

02

GlowVC achieves high speaker similarity in intra-lingual conversion.

03

GlowVC-explicit surpasses other models in naturalness.

Abstract

In this paper, we propose GlowVC: a multilingual multi-speaker flow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables use of linguistic features during training without the necessity of using them for VC inference. We consider two versions of our model: GlowVC-conditional and GlowVC-explicit. GlowVC-conditional models the distribution of mel-spectrograms with speaker-conditioned flow and disentangles the mel-spectrogram space into content- and pitch-relevant dimensions, while GlowVC-explicit models the explicit distribution with unconditioned flow and disentangles said space into content-, pitch- and speaker-relevant dimensions. We evaluate our models in terms of intelligibility, speaker similarity and naturalness for intra- and cross-lingual conversion in seen and unseen languages. GlowVC models greatly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsNormalizing Flows · Affine Coupling · Activation Normalization · Invertible 1x1 Convolution · GLOW · Glow-TTS