Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Praveen Narayanan; Punarjay Chakravarty; Francois Charette; Gint; Puskorius

arXiv:1907.07769·eess.AS·July 19, 2019·5 cites

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Praveen Narayanan, Punarjay Chakravarty, Francois Charette, Gint, Puskorius

PDF

Open Access

TL;DR

This paper introduces a hierarchical sequence-to-sequence voice conversion method leveraging attention mechanisms, trained on large single-speaker datasets and adapted for multi-speaker scenarios, using mel spectrograms and a neural vocoder.

Contribution

It proposes a novel hierarchical seq2seq architecture for voice conversion that effectively utilizes limited multi-speaker data by pretraining on single-speaker datasets.

Findings

01

Achieves effective voice conversion with limited multi-speaker data.

02

Uses mel spectrograms and attention-based models for improved quality.

03

Employs a neural vocoder for high-quality audio synthesis.

Abstract

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$ source,target $>$ } audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsMixture of Logistic Distributions · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Dilated Causal Convolution · WaveNet