Improving robustness of one-shot voice conversion with deep   discriminative speaker encoder

Hongqiang Du; Lei Xie

arXiv:2106.10406·cs.SD·June 22, 2021

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Hongqiang Du, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a deep discriminative speaker encoder that enhances the robustness and quality of one-shot voice conversion for unseen speakers by effectively extracting speaker embeddings from a single utterance.

Contribution

It proposes a novel speaker encoder combining residual, squeeze-and-excitation, and attention mechanisms to improve speaker embedding reliability in one-shot voice conversion.

Findings

01

Improved speaker similarity in voice conversion

02

Enhanced speech quality over baseline systems

03

Better robustness for unseen speakers

Abstract

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing