FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features

Wenyu Wang; Zhetao Hu; Yiquan Zhou; Jiacheng Xu; Zhiyu Wu; Chen Li; Shihao Li

arXiv:2511.10112·cs.SD·November 14, 2025

FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features

Wenyu Wang, Zhetao Hu, Yiquan Zhou, Jiacheng Xu, Zhiyu Wu, Chen Li, Shihao Li

PDF

Open Access

TL;DR

FabasedVC is a novel voice conversion system that fuses text and phoneme-level SSL features to improve speaker similarity, prosody, and content preservation, outperforming existing methods.

Contribution

It introduces a multi-modal fusion approach combining textual features and phoneme-level SSL features within an end-to-end VC system, enhancing conversion quality.

Findings

01

Outperforms competing systems in naturalness and similarity.

02

Improves content integrity in voice conversion.

03

Effectively aligns speech rate and prosody.

Abstract

In voice conversion (VC), it is crucial to preserve complete semantic information while accurately modeling the target speaker's timbre and prosody. This paper proposes FabasedVC to achieve VC with enhanced similarity in timbre, prosody, and duration to the target speaker, as well as improved content integrity. It is an end-to-end VITS-based VC system that integrates relevant textual modality information, phoneme-level self-supervised learning (SSL) features, and a duration predictor. Specifically, we employ a text feature encoder to encode attributes such as text, phonemes, tones and BERT features. We then process the frame-level SSL features into phoneme-level features using two methods: average pooling and attention mechanism based on each phoneme's duration. Moreover, a duration predictor is incorporated to better align the speech rate and prosody of the target speaker. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research