Residual Speaker Representation for One-Shot Voice Conversion
Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen,, Jianhua Tao

TL;DR
This paper introduces a residual speaker module that improves robustness to unseen speakers and enhances timbre control in voice conversion, achieving superior performance over existing methods.
Contribution
It proposes a novel residual speaker representation using multi-layer residual tokens, addressing robustness and timbre control challenges in voice conversion.
Findings
Outperforms baseline methods in subjective evaluations
Demonstrates increased robustness to unseen speakers
Enables effective timbre control in voice conversion
Abstract
Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
