ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko,, Nobukatsu Hojo

TL;DR
ConvS2S-VC introduces a fully convolutional sequence-to-sequence model for voice conversion that efficiently handles many-to-many speaker conversion, pitch, and duration modifications, achieving high quality and versatility.
Contribution
It presents a novel fully convolutional seq2seq architecture with conditional batch normalization for flexible, multi-speaker voice conversion in a single model.
Findings
Higher sound quality and speaker similarity than baseline methods.
Effective in emotional expression, accent conversion, and speech enhancement.
Handles any-to-many conversion without source speaker info.
Abstract
This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Dilated Causal Convolution · Causal Convolution · Convolution
