StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Fengjin Li; Jie Wang; Yadong Niu; Yongqing Wang; Meng Meng; Jian Luan; Zhiyong Wu

arXiv:2506.02414·cs.MM·June 4, 2025

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu

PDF

Open Access

TL;DR

StarVC introduces a unified autoregressive framework for voice conversion that explicitly models text to improve the preservation of linguistic content and speaker identity, outperforming traditional methods.

Contribution

It is the first to integrate explicit text prediction into voice conversion, enhancing content preservation and speaker similarity.

Findings

01

Outperforms traditional VC in content preservation (WER, CER)

02

Achieves higher speaker similarity (SECS, MOS)

03

Demonstrates effectiveness through experiments

Abstract

Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling