Discrete Optimal Transport and Voice Conversion
Anton Selitskiy, Maitreya Kocharekar

TL;DR
This paper introduces a novel voice conversion method using discrete optimal transport to align audio embeddings, achieving high-quality results and revealing an adversarial attack vulnerability in audio generation.
Contribution
It applies discrete optimal transport for voice conversion and explores its effects as a post-processing adversarial attack in audio synthesis.
Findings
High-quality voice conversion achieved
Discrete OT can cause misclassification as real in synthetic speech
Ablation study on embedding numbers enhances understanding
Abstract
In this work, we address the task of voice conversion (VC) using a vector-based interface. To align audio embeddings across speakers, we employ discrete optimal transport (OT) and approximate the transport map using the barycentric projection. Our evaluation demonstrates that this approach yields high-quality and effective voice conversion. We also perform an ablation study on the number of embeddings used, extending previous work on simple averaging of kNN and OT results. Additionally, we show that applying discrete OT as a post-processing step in audio generation can cause synthetic speech to be misclassified as real, revealing a novel and strong adversarial attack.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
MethodsALIGN
