Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure   Transformer Blocks and Triplet Discriminative Training

Wenhan Yao; Zedong Xing; Xiarun Chen; Jia Liu; Yongqiang He; Weiping; Wen

arXiv:2409.01668·cs.SD·November 26, 2024

Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training

Wenhan Yao, Zedong Xing, Xiarun Chen, Jia Liu, Yongqiang He, Weiping, Wen

PDF

Open Access

TL;DR

Pureformer-VC introduces a novel non-parallel, one-shot voice conversion framework utilizing pure Transformer blocks and triplet discriminative training, achieving improved objective metrics and comparable subjective quality.

Contribution

The paper proposes a new voice conversion model with pure Transformer-based architecture and triplet loss, enhancing disentanglement and style transfer without parallel data.

Findings

01

Achieves comparable subjective voice quality to existing methods.

02

Shows improvements in objective speech conversion metrics.

03

Effective disentanglement of speaker characteristics.

Abstract

One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics effectively into the generated speech. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsTriplet Loss