Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels
Cl\'ement Le Moine Veillon, Nicolas Obin, Axel Roebel

TL;DR
This paper introduces an end-to-end neural network framework utilizing Dual-GAN and convolutional wavelet kernels for expressive F0 voice conversion, enabling efficient multi-scale F0 representation and emotion transformation.
Contribution
It proposes a novel end-to-end F0 transformation model combining wavelet-based feature extraction with adversarial learning for emotion conversion.
Findings
Effective multi-scale F0 encoding achieved
Successful emotion-to-emotion F0 transformation demonstrated
End-to-end training improves F0 conversion quality
Abstract
This paper presents a end-to-end framework for the F0 transformation in the context of expressive voice conversion. A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales and a second adversarial module is used to learn the transformation from one emotion to another. The first module is composed of a convolution layer with wavelet kernels so that the various temporal scales of F0 variations can be efficiently encoded. The single decomposition/transformation network allows to learn in a end-to-end manner the F0 decomposition that are optimal with respect to the transformation, directly from the raw F0 signal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution
