Text-free non-parallel many-to-many voice conversion using normalising flows
Thomas Merritt, Abdelhamid Ezzerg, Piotr Bili\'nski, Magdalena, Proszewska, Kamil Pokora, Roberto Barra-Chicote, Daniel Korzekwa

TL;DR
This paper explores the use of normalising flows for non-parallel, text-free voice conversion, demonstrating lossless speech encoding and improved performance over existing methods, with insights on prior training strategies.
Contribution
It introduces a novel application of normalising flows for voice conversion, comparing text-conditioned and text-free scenarios, and analyzing the impact of prior training methods.
Findings
Normalising flows enable lossless speech encoding for VC.
No performance degradation between text-free and text-conditioned VC.
Joint training of priors negatively affects text-free VC quality.
Abstract
Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a large challenge. This is particularly challenging in the scenario where at inference-time we have no knowledge of the text being read, i.e., text-free VC. To mitigate this, we investigate information-preserving VC approaches. Normalising flows have gained attention for text-to-speech synthesis, however have been under-explored for VC. Flows utilize invertible functions to learn the likelihood of the data, thus provide a lossless encoding of speech. We investigate normalising flows for VC in both text-conditioned and text-free scenarios. Furthermore, for text-free VC we compare pre-trained and jointly-learnt priors. Flow-based VC evaluations show no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
