Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters
Mathilde Abrassart, Nicolas Obin, Axel Roebel

TL;DR
Fast-VGAN introduces a lightweight neural network for voice conversion that allows explicit, intuitive control over pitch, duration, and speaker identity, enabling flexible and high-quality speech transformations.
Contribution
The paper presents a convolutional neural network that explicitly conditions on speech parameters for controllable voice conversion without disentanglement techniques.
Findings
Achieves high intelligibility and speaker similarity.
Provides flexible control over F0, duration, and speaker identity.
Performs well on speaker conversion and expressive speech tasks.
Abstract
Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important element for effective identity conversion, but can also be used independently for voice transformation, achieving goals that were historically addressed by vocoder-based methods. In this work, we explore a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity. Rather than relying on disentanglement techniques, our model is explicitly conditioned on these factors to generate mel spectrograms, which are then converted into waveforms using a universal neural vocoder. Accordingly, during inference, F0 contours, phoneme sequences, and speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
