vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Yiwei Guo; Zhihan Li; Junjie Li; Chenpeng Du; Hankun Wang; Shuai Wang; Xie Chen; Kai Yu

arXiv:2409.01995·eess.AS·May 27, 2025

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

PDF

Open Access

TL;DR

vec2wav 2.0 introduces a novel discrete token vocoder for voice conversion that leverages self-supervised speech models and a new activation function, achieving superior quality and speaker similarity without supervised data.

Contribution

The paper presents vec2wav 2.0, a new vocoder that enhances voice conversion by incorporating timbre information through WavLM features and an adaptive activation, enabling high-quality, data-efficient, and cross-lingual VC.

Findings

01

Outperforms baselines in audio quality and speaker similarity

02

Effective in cross-lingual voice conversion with monolingual training

03

No supervised data needed for training

Abstract

We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing