Discrete Optimal Transport and Voice Conversion

Anton Selitskiy; Maitreya Kocharekar

arXiv:2505.04382·eess.AS·March 2, 2026

Discrete Optimal Transport and Voice Conversion

Anton Selitskiy, Maitreya Kocharekar

PDF

Open Access

TL;DR

This paper introduces a novel voice conversion method using discrete optimal transport to align audio embeddings, achieving high-quality results and revealing an adversarial attack vulnerability in audio generation.

Contribution

It applies discrete optimal transport for voice conversion and explores its effects as a post-processing adversarial attack in audio synthesis.

Findings

01

High-quality voice conversion achieved

02

Discrete OT can cause misclassification as real in synthetic speech

03

Ablation study on embedding numbers enhances understanding

Abstract

In this work, we address the task of voice conversion (VC) using a vector-based interface. To align audio embeddings across speakers, we employ discrete optimal transport (OT) and approximate the transport map using the barycentric projection. Our evaluation demonstrates that this approach yields high-quality and effective voice conversion. We also perform an ablation study on the number of embeddings used, extending previous work on simple averaging of kNN and OT results. Additionally, we show that applying discrete OT as a post-processing step in audio generation can cause synthetic speech to be misclassified as real, revealing a novel and strong adversarial attack.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders

MethodsALIGN