Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training

Wenhan Yao; Fen Xiao; Xiarun Chen; Jia Liu; YongQiang He; Weiping Wen

arXiv:2506.08348·cs.SD·June 11, 2025

Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training

Wenhan Yao, Fen Xiao, Xiarun Chen, Jia Liu, YongQiang He, Weiping Wen

PDF

Open Access 1 Models

TL;DR

Pureformer-VC introduces a novel non-parallel voice conversion framework using specialized transformer blocks and triplet discriminative training, achieving improved objective metrics while maintaining naturalness in speech conversion.

Contribution

It presents a new encoder-decoder model with disentangled speech encoding and style transfer mechanisms, advancing non-parallel voice conversion technology.

Findings

01

Achieves comparable subjective quality to existing methods.

02

Significantly improves objective evaluation metrics.

03

Effective in many-to-many and many-to-one VC scenarios.

Abstract

As a foundational technology for intelligent human-computer interaction, voice conversion (VC) seeks to transform speech from any source timbre into any target timbre. Traditional voice conversion methods based on Generative Adversarial Networks (GANs) encounter significant challenges in precisely encoding diverse speech elements and effectively synthesising these elements into natural-sounding converted speech. To overcome these limitations, we introduce Pureformer-VC, an encoder-decoder framework that utilizes Conformer blocks to build a disentangled encoder and employs Zipformer blocks to create a style transfer decoder. We adopt a variational decoupled training approach to isolate speech components using a Variational Autoencoder (VAE), complemented by triplet discriminative training to enhance the speaker's discriminative capabilities. Furthermore, we incorporate the Attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
oxhatestrading/voice-synthesis
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing