Learning Expressive Disentangled Speech Representations with Soft Speech   Units and Adversarial Style Augmentation

Yimin Deng; Jianzong Wang; Xulong Zhang; Ning Cheng; Jing Xiao

arXiv:2405.00603·cs.SD·May 2, 2024

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, Jing Xiao

PDF

Open Access

TL;DR

This paper introduces SAVC, a novel expressive voice conversion framework that uses soft speech units and adversarial style augmentation to improve content preservation and prosody modeling, resulting in more natural and intelligible converted speech.

Contribution

The paper proposes a new voice conversion method utilizing soft speech units and adversarial style augmentation to better disentangle speaker, content, and prosody features.

Findings

01

Improved speech naturalness and intelligibility over previous methods.

02

Effective elimination of speaker information through adversarial style augmentation.

03

Implicit prosody modeling enhances expressive voice conversion.

Abstract

Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing