ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Hirokazu Kameoka; Kou Tanaka; Damian Kwasny; Takuhiro Kaneko,; Nobukatsu Hojo

arXiv:1811.01609·cs.SD·October 8, 2020·20 cites

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko,, Nobukatsu Hojo

PDF

Open Access

TL;DR

ConvS2S-VC introduces a fully convolutional sequence-to-sequence model for voice conversion that efficiently handles many-to-many speaker conversion, pitch, and duration modifications, achieving high quality and versatility.

Contribution

It presents a novel fully convolutional seq2seq architecture with conditional batch normalization for flexible, multi-speaker voice conversion in a single model.

Findings

01

Higher sound quality and speaker similarity than baseline methods.

02

Effective in emotional expression, accent conversion, and speech enhancement.

03

Handles any-to-many conversion without source speaker info.

Abstract

This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Dilated Causal Convolution · Causal Convolution · Convolution