Training-Free Voice Conversion with Factorized Optimal Transport

Alexander Lobashev; Assel Yermekova; Maria Larchenko

arXiv:2506.09709·cs.SD·June 12, 2025

Training-Free Voice Conversion with Factorized Optimal Transport

Alexander Lobashev, Assel Yermekova, Maria Larchenko

PDF

Open Access 1 Repo

TL;DR

This paper presents a training-free voice conversion method called Factorized MKL-VC that uses optimal transport in WavLM embeddings to achieve high-quality, cross-lingual voice conversion with only 5 seconds of reference audio, outperforming previous kNN-based methods.

Contribution

The paper introduces a novel training-free voice conversion approach using factorized optimal transport, improving quality and robustness over existing kNN-VC methods.

Findings

01

Significantly improves content preservation and robustness.

02

Achieves performance comparable to state-of-the-art methods like FACodec.

03

Effective with only 5 seconds of reference audio.

Abstract

This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alobashev/mkl-vc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing