Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations
Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

TL;DR
This paper introduces a sequence-to-sequence voice conversion method that leverages self-supervised discrete speech representations, enabling effective conversion from any speaker to a target speaker with minimal data and no need for parallel datasets.
Contribution
The approach uses VQ-wav2vec representations within a seq2seq framework, allowing high-quality voice conversion with only 5 minutes of target speaker data, outperforming models trained on parallel data.
Findings
Effective conversion with only 5 minutes of data
Outperforms models trained on parallel data
Utilizes self-supervised speech representations for speaker-independent features
Abstract
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
