Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

Seymanur Akti; Tuan Nam Nguyen; Alexander Waibel

arXiv:2506.04013·cs.SD·June 5, 2025

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

Seymanur Akti, Tuan Nam Nguyen, Alexander Waibel

PDF

Open Access

TL;DR

This paper presents a non-autoregressive voice conversion model that improves disentanglement of style and content, reduces source leakage, and enhances expressive style transfer using novel representations and training strategies.

Contribution

It introduces a multilingual discrete speech unit-based content representation and augmentation-based loss to improve style-content disentanglement in non-autoregressive voice conversion.

Findings

01

Outperforms baselines in emotion similarity

02

Achieves better speaker similarity

03

Reduces source style leakage

Abstract

Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing