Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN   over Phoneme Posteriorgram Sequences

Cheng-chieh Yeh; Po-chun Hsu; Ju-chieh Chou; Hung-yi Lee; Lin-shan Lee

arXiv:1808.03113·cs.SD·August 10, 2018·1 cites

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, Lin-shan Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel rhythm-flexible voice conversion method that leverages an unsupervised Cycle-GAN with sequence-to-sequence modeling to transform phoneme posteriorgram sequences without parallel data, removing length constraints.

Contribution

It proposes a new approach combining Cycle-GAN and sequence-to-sequence models for non-parallel, rhythm-flexible voice conversion that overcomes previous length and data limitations.

Findings

01

Encouraging results on two datasets.

02

Effective removal of length constraints.

03

Successful transformation of phoneme sequences without parallel data.

Abstract

Speaking rate refers to the average number of phonemes within some unit time, while the rhythmic patterns refer to duration distributions for realizations of different phonemes within different phonetic structures. Both are key components of prosody in speech, which is different for different speakers. Models like cycle-consistent adversarial network (Cycle-GAN) and variational auto-encoder (VAE) have been successfully applied to voice conversion tasks without parallel data. However, due to the neural network architectures and feature vectors chosen for these approaches, the length of the predicted utterance has to be fixed to that of the input utterance, which limits the flexibility in mimicking the speaking rates and rhythmic patterns for the target speaker. On the other hand, sequence-to-sequence learning model was used to remove the above length constraint, but parallel training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

acetylSv/rhythmic-flexible-vc-arch
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing