Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence   Modeling

Songxiang Liu; Yuewen Cao; Disong Wang; Xixin Wu; Xunying Liu; Helen; Meng

arXiv:2009.02725·eess.AS·May 25, 2021

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, Helen, Meng

PDF

1 Repo

TL;DR

This paper introduces an any-to-many voice conversion method using a sequence-to-sequence model with location-relative attention, leveraging text supervision and a bottleneck feature extractor for improved naturalness and speaker similarity.

Contribution

It proposes a novel any-to-many voice conversion framework combining a phoneme recognizer, bottleneck features, and a location-relative attention seq2seq model, enabling high-quality non-parallel voice conversion.

Findings

01

Superior naturalness and speaker similarity in voice conversion results

02

Effective alignment of long sequences using MoL attention and down-sampling

03

The approach can be extended to any-to-any voice conversion with high performance

Abstract

This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liusongxiang/ppg-vc
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFeature Selection · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence