Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling
Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin, Zhou, Lei Xie, Lei Ma, Jianjun Zhao

TL;DR
Takin-VC is a novel zero-shot voice conversion framework that significantly improves speech naturalness, expressiveness, and speaker similarity by using adaptive hybrid encoding and memory-augmented timbre modeling.
Contribution
It introduces a hybrid content encoder with adaptive fusion and memory-augmented timbre modeling, advancing zero-shot VC capabilities for expressive speech.
Findings
Outperforms state-of-the-art VC systems in naturalness and expressiveness
Achieves higher speaker similarity in zero-shot scenarios
Offers improved inference speed for real-time applications
Abstract
Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems struggle to fully reproduce paralinguistic information in highly expressive speech, such as breathing, crying, and emotional nuances, limiting their practical applicability. To address these issues, we propose Takin-VC, a novel expressive zero-shot VC framework via adaptive hybrid content encoding and memory-augmented context-aware timbre modeling. Specifically, we introduce an innovative hybrid content encoder that incorporates an adaptive fusion module, capable of effectively integrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsALIGN
