AdaptVC: High Quality Voice Conversion with Adaptive Learning
Jaehun Kim, Ji-Hoon Kim, Yeunju Choi, Tan Dat Nguyen, Seongkyu Mun,, Joon Son Chung

TL;DR
AdaptVC introduces a novel voice conversion method that employs adaptive learning with self-supervised features and a conditional flow matching decoder, achieving high-quality, zero-shot voice conversion with improved robustness and similarity.
Contribution
The paper presents a new approach using adapters and a conditional flow matching decoder for disentangling content and speaker features in voice conversion, enhancing zero-shot performance.
Findings
Outperforms existing models in speech quality and similarity in zero-shot scenarios.
Achieves effective disentanglement of content and speaker features.
Demonstrates robustness and efficiency in voice conversion tasks.
Abstract
The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis · Speech and Audio Processing
