FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and   Fusing Fine-Grained Voice Fragments With Attention

Yist Y. Lin; Chung-Ming Chien; Jheng-Hao Lin; Hung-yi Lee; Lin-shan; Lee

arXiv:2010.14150·eess.AS·May 4, 2021

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, Lin-shan, Lee

PDF

2 Repos 1 Models

TL;DR

FragmentVC introduces an end-to-end voice conversion method that extracts and fuses fine-grained voice fragments using attention, enabling any-to-any conversion even with unseen speakers without requiring parallel data.

Contribution

It proposes a novel approach combining Wav2Vec 2.0 and Transformer attention to perform any-to-any voice conversion without disentanglement or parallel data.

Findings

01

Outperforms SOTA methods like AdaIN-VC and AutoVC in objective and subjective evaluations.

02

Effectively extracts and fuses voice fragments for high-quality conversion.

03

Operates end-to-end with only reconstruction loss, no disentanglement needed.

Abstract

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
fierce-cats/beatrice-trainer
model· ♡ 39
♡ 39

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Byte Pair Encoding · Softmax · Adam · Dense Connections