TL;DR
This paper introduces an any-to-many voice conversion method using a sequence-to-sequence model with location-relative attention, leveraging text supervision and a bottleneck feature extractor for improved naturalness and speaker similarity.
Contribution
It proposes a novel any-to-many voice conversion framework combining a phoneme recognizer, bottleneck features, and a location-relative attention seq2seq model, enabling high-quality non-parallel voice conversion.
Findings
Superior naturalness and speaker similarity in voice conversion results
Effective alignment of long sequences using MoL attention and down-sampling
The approach can be extended to any-to-any voice conversion with high performance
Abstract
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
