Delivering Speaking Style in Low-resource Voice Conversion with   Multi-factor Constraints

Zhichao Wang; Xinsheng Wang; Lei Xie; Yuanzhe Chen; Qiao Tian; Yuping; Wang

arXiv:2211.08857·eess.AS·March 15, 2023

Delivering Speaking Style in Low-resource Voice Conversion with Multi-factor Constraints

Zhichao Wang, Xinsheng Wang, Lei Xie, Yuanzhe Chen, Qiao Tian, Yuping, Wang

PDF

Open Access

TL;DR

This paper introduces MFC-StyleVC, a novel low-resource voice conversion model that effectively preserves speaking style and target speaker timbre using multi-factor constraints and a simulation mode, outperforming existing methods.

Contribution

The paper proposes a new VC model with a clustering-based timbre constraint, perceptual regularization, and a simulation mode to improve low-resource voice conversion performance.

Findings

01

Outperforms existing low-resource VC methods in expressive speech tasks.

02

Effectively maintains speaking style, content, and quality with limited data.

03

Demonstrates robustness through extensive experiments.

Abstract

Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques