AdaptVC: High Quality Voice Conversion with Adaptive Learning

Jaehun Kim; Ji-Hoon Kim; Yeunju Choi; Tan Dat Nguyen; Seongkyu Mun,; Joon Son Chung

arXiv:2501.01347·cs.SD·January 15, 2025

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Jaehun Kim, Ji-Hoon Kim, Yeunju Choi, Tan Dat Nguyen, Seongkyu Mun,, Joon Son Chung

PDF

Open Access

TL;DR

AdaptVC introduces a novel voice conversion method that employs adaptive learning with self-supervised features and a conditional flow matching decoder, achieving high-quality, zero-shot voice conversion with improved robustness and similarity.

Contribution

The paper presents a new approach using adapters and a conditional flow matching decoder for disentangling content and speaker features in voice conversion, enhancing zero-shot performance.

Findings

01

Outperforms existing models in speech quality and similarity in zero-shot scenarios.

02

Achieves effective disentanglement of content and speaker features.

03

Demonstrates robustness and efficiency in voice conversion tasks.

Abstract

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis · Speech and Audio Processing