Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion   of Bottleneck and Perturbation Features

Ziqian Ning; Qicong Xie; Pengcheng Zhu; Zhichao Wang; Liumeng Xue,; Jixun Yao; Lei Xie; Mengxiao Bi

arXiv:2211.04710·eess.AS·November 10, 2022·1 cites

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Ziqian Ning, Qicong Xie, Pengcheng Zhu, Zhichao Wang, Liumeng Xue,, Jixun Yao, Lei Xie, Mengxiao Bi

PDF

Open Access

TL;DR

Expressive-VC is a novel end-to-end voice conversion framework that combines neural bottleneck and perturbation features with attention fusion, achieving high expressiveness, speaker similarity, and intelligibility in converted speech.

Contribution

It introduces a new fusion approach using attention mechanism to combine linguistic and para-linguistic features for expressive voice conversion.

Findings

01

Outperforms state-of-the-art systems in expressiveness and speaker similarity.

02

Maintains high intelligibility in converted speech.

03

Effectively captures source expressiveness while preserving target speaker identity.

Abstract

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balancing between speaker similarity, intelligibility and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both neural bottleneck feature (BNF) approach and information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are adopted as the attention query, which result from a prosody encoder with target speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders