SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

Shivam Mehta; Yingru Liu; Zhenyu Tang; Kainan Peng; Vimal Manohar; Shun Zhang; Mike Seltzer; Qing He; Mingbo Ma

arXiv:2507.09070·eess.AS·July 15, 2025

SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

Shivam Mehta, Yingru Liu, Zhenyu Tang, Kainan Peng, Vimal Manohar, Shun Zhang, Mike Seltzer, Qing He, Mingbo Ma

PDF

Open Access

TL;DR

SemAlignVC introduces a novel semantic alignment method to improve zero-shot voice conversion by effectively reducing timbre leakage and enhancing speech quality without relying on explicit speaker embeddings.

Contribution

The paper presents SemAlignVC, a new architecture that aligns text and audio representations to disentangle speaker identity from content, enabling high-fidelity zero-shot voice conversion.

Findings

01

Significantly reduces timbre leakage compared to baselines

02

Outperforms in speaker similarity, intelligibility, and naturalness

03

Provides a privacy-preserving and generalizable VC solution

Abstract

Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Enhancement Techniques · Speech and Audio Processing