Residual Speaker Representation for One-Shot Voice Conversion

Le Xu; Jiangyan Yi; Tao Wang; Yong Ren; Rongxiu Zhong; Zhengqi Wen,; Jianhua Tao

arXiv:2309.08166·cs.SD·August 13, 2024

Residual Speaker Representation for One-Shot Voice Conversion

Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen,, Jianhua Tao

PDF

Open Access

TL;DR

This paper introduces a residual speaker module that improves robustness to unseen speakers and enhances timbre control in voice conversion, achieving superior performance over existing methods.

Contribution

It proposes a novel residual speaker representation using multi-layer residual tokens, addressing robustness and timbre control challenges in voice conversion.

Findings

01

Outperforms baseline methods in subjective evaluations

02

Demonstrates increased robustness to unseen speakers

03

Enables effective timbre control in voice conversion

Abstract

Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing