Hierarchical Sequence to Sequence Voice Conversion with Limited Data
Praveen Narayanan, Punarjay Chakravarty, Francois Charette, Gint, Puskorius

TL;DR
This paper introduces a hierarchical sequence-to-sequence voice conversion method leveraging attention mechanisms, trained on large single-speaker datasets and adapted for multi-speaker scenarios, using mel spectrograms and a neural vocoder.
Contribution
It proposes a novel hierarchical seq2seq architecture for voice conversion that effectively utilizes limited multi-speaker data by pretraining on single-speaker datasets.
Findings
Achieves effective voice conversion with limited multi-speaker data.
Uses mel spectrograms and attention-based models for improved quality.
Employs a neural vocoder for high-quality audio synthesis.
Abstract
We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it source,target} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsMixture of Logistic Distributions · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Dilated Causal Convolution · WaveNet
