Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using   Transformer with Text-to-Speech Pretraining

Wen-Chin Huang; Tomoki Hayashi; Yi-Chiao Wu; Hirokazu Kameoka; Tomoki; Toda

arXiv:1912.06813·eess.AS·December 17, 2019·38 cites

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki, Toda

PDF

Open Access 2 Repos

TL;DR

This paper presents a Transformer-based sequence-to-sequence voice conversion model that leverages TTS pretraining to improve speech naturalness, intelligibility, and data efficiency, outperforming RNN-based models.

Contribution

It introduces a novel Transformer-based VC model with a TTS pretraining scheme, enhancing data efficiency and speech quality in voice conversion.

Findings

01

Transformer-based VC outperforms RNN-based models

02

Pretraining improves speech naturalness and intelligibility

03

Model requires less data for high-quality conversion

Abstract

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Nonetheless, their data-hungry property and the mispronunciation of converted speech make seq2seq models far from practical. To this end, we propose a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora. VC models initialized with such pretrained model parameters are able to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?