Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised   Discrete Speech Representations

Wen-Chin Huang; Yi-Chiao Wu; Tomoki Hayashi; Tomoki Toda

arXiv:2010.12231·eess.AS·October 26, 2020·1 cites

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces a sequence-to-sequence voice conversion method that leverages self-supervised discrete speech representations, enabling effective conversion from any speaker to a target speaker with minimal data and no need for parallel datasets.

Contribution

The approach uses VQ-wav2vec representations within a seq2seq framework, allowing high-quality voice conversion with only 5 minutes of target speaker data, outperforming models trained on parallel data.

Findings

01

Effective conversion with only 5 minutes of data

02

Outperforms models trained on parallel data

03

Utilizes self-supervised speech representations for speaker-independent features

Abstract

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence