FastVC: Fast Voice Conversion with non-parallel data
Oriol Barbany Mayor, Milos Cernak

TL;DR
FastVC is an efficient, end-to-end voice conversion model that operates on non-parallel data, producing natural-sounding speech across multiple speakers with minimal resource requirements.
Contribution
It introduces a non-parallel, conditional AutoEncoder model for fast voice conversion that outperforms existing baselines in cross-lingual naturalness.
Findings
Outperforms VC Challenge 2020 baselines in naturalness
Operates on non-parallel data without annotations
Balances resource efficiency with speech quality
Abstract
This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. This model's latent representation is shown to be speaker-independent and similar to phonemes, which is a desirable feature for VC systems. While the current VC systems primarily focus on achieving the highest overall speech quality, this paper tries to balance the development concerning resources needed to run the systems. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
