FastVC: Fast Voice Conversion with non-parallel data

Oriol Barbany Mayor; Milos Cernak

arXiv:2010.04185·eess.AS·May 7, 2021

FastVC: Fast Voice Conversion with non-parallel data

Oriol Barbany Mayor, Milos Cernak

PDF

TL;DR

FastVC is an efficient, end-to-end voice conversion model that operates on non-parallel data, producing natural-sounding speech across multiple speakers with minimal resource requirements.

Contribution

It introduces a non-parallel, conditional AutoEncoder model for fast voice conversion that outperforms existing baselines in cross-lingual naturalness.

Findings

01

Outperforms VC Challenge 2020 baselines in naturalness

02

Operates on non-parallel data without annotations

03

Balances resource efficiency with speech quality

Abstract

This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. This model's latent representation is shown to be speaker-independent and similar to phonemes, which is a desirable feature for VC systems. While the current VC systems primarily focus on achieving the highest overall speech quality, this paper tries to balance the development concerning resources needed to run the systems. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729