FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features
Wenyu Wang, Zhetao Hu, Yiquan Zhou, Jiacheng Xu, Zhiyu Wu, Chen Li, Shihao Li

TL;DR
FabasedVC is a novel voice conversion system that fuses text and phoneme-level SSL features to improve speaker similarity, prosody, and content preservation, outperforming existing methods.
Contribution
It introduces a multi-modal fusion approach combining textual features and phoneme-level SSL features within an end-to-end VC system, enhancing conversion quality.
Findings
Outperforms competing systems in naturalness and similarity.
Improves content integrity in voice conversion.
Effectively aligns speech rate and prosody.
Abstract
In voice conversion (VC), it is crucial to preserve complete semantic information while accurately modeling the target speaker's timbre and prosody. This paper proposes FabasedVC to achieve VC with enhanced similarity in timbre, prosody, and duration to the target speaker, as well as improved content integrity. It is an end-to-end VITS-based VC system that integrates relevant textual modality information, phoneme-level self-supervised learning (SSL) features, and a duration predictor. Specifically, we employ a text feature encoder to encode attributes such as text, phonemes, tones and BERT features. We then process the frame-level SSL features into phoneme-level features using two methods: average pooling and attention mechanism based on each phoneme's duration. Moreover, a duration predictor is incorporated to better align the speech rate and prosody of the target speaker. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research
