Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice   Conversion

Zhichao Wang; Liumeng Xue; Qiuqiang Kong; Lei Xie; Yuanzhe Chen; Qiao; Tian; Yuping Wang

arXiv:2305.07204·eess.AS·May 21, 2024·1 cites

Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion

Zhichao Wang, Liumeng Xue, Qiuqiang Kong, Lei Xie, Yuanzhe Chen, Qiao, Tian, Yuping Wang

PDF

Open Access

TL;DR

This paper introduces MTCR-VC, a zero-shot voice conversion model that employs multi-level temporal-channel speaker retrieval to better model unseen speakers by capturing speaker information across temporal and frequency dimensions.

Contribution

The paper proposes a novel temporal-channel retrieval method and a multi-granularity speaker modeling framework for improved zero-shot voice conversion.

Findings

01

Outperforms previous methods in modeling speaker timbre.

02

Maintains high speech naturalness.

03

Effective in zero-shot scenarios.

Abstract

Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing