Training-Free Voice Conversion with Factorized Optimal Transport
Alexander Lobashev, Assel Yermekova, Maria Larchenko

TL;DR
This paper presents a training-free voice conversion method called Factorized MKL-VC that uses optimal transport in WavLM embeddings to achieve high-quality, cross-lingual voice conversion with only 5 seconds of reference audio, outperforming previous kNN-based methods.
Contribution
The paper introduces a novel training-free voice conversion approach using factorized optimal transport, improving quality and robustness over existing kNN-VC methods.
Findings
Significantly improves content preservation and robustness.
Achieves performance comparable to state-of-the-art methods like FACodec.
Effective with only 5 seconds of reference audio.
Abstract
This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
