GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion
Magdalena Proszewska, Grzegorz Beringer, Daniel S\'aez-Trigueros,, Thomas Merritt, Abdelhamid Ezzerg, Roberto Barra-Chicote

TL;DR
GlowVC introduces a flow-based, language-independent voice conversion model that disentangles content, pitch, and speaker features in mel-spectrograms, achieving superior intelligibility and naturalness across multiple languages.
Contribution
The paper presents GlowVC, a novel flow-based model for multilingual, text-free voice conversion that effectively disentangles speech features without relying on linguistic inputs during inference.
Findings
GlowVC outperforms AutoVC in intelligibility.
GlowVC achieves high speaker similarity in intra-lingual conversion.
GlowVC-explicit surpasses other models in naturalness.
Abstract
In this paper, we propose GlowVC: a multilingual multi-speaker flow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables use of linguistic features during training without the necessity of using them for VC inference. We consider two versions of our model: GlowVC-conditional and GlowVC-explicit. GlowVC-conditional models the distribution of mel-spectrograms with speaker-conditioned flow and disentangles the mel-spectrogram space into content- and pitch-relevant dimensions, while GlowVC-explicit models the explicit distribution with unconditioned flow and disentangles said space into content-, pitch- and speaker-relevant dimensions. We evaluate our models in terms of intelligibility, speaker similarity and naturalness for intra- and cross-lingual conversion in seen and unseen languages. GlowVC models greatly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsNormalizing Flows · Affine Coupling · Activation Normalization · Invertible 1x1 Convolution · GLOW · Glow-TTS
