MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice   Conversion by Multi-scale Style Modeling

Zhichao Wang; Xinsheng Wang; Qicong Xie; Tao Li; Lei Xie; Qiao Tian,; Yuping Wang

arXiv:2309.01142·eess.AS·September 6, 2023

MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

Zhichao Wang, Xinsheng Wang, Qicong Xie, Tao Li, Lei Xie, Qiao Tian,, Yuping Wang

PDF

Open Access

TL;DR

This paper introduces MSM-VC, a multi-scale style modeling approach for voice conversion that captures comprehensive speaking styles at different levels while preserving target speaker identity, improving style transfer quality.

Contribution

The paper proposes a novel multi-scale style modeling method for VC, utilizing diverse features at different levels and an explicit constraint module to enhance style transfer and speaker preservation.

Findings

01

MSM-VC outperforms state-of-the-art methods in style modeling accuracy.

02

It maintains high speech quality and speaker similarity.

03

The approach effectively captures expressive speaking styles.

Abstract

In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embedding extracted from source speech to model the speaking style of source speech, which is insufficient to achieve comprehensive style modeling and target speaker timbre preservation. Inspired by the style's multi-scale nature of human speech, a multi-scale style modeling method for the VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the speaking style of source speech from different levels. To effectively convey the speaking style and meanwhile prevent timbre leakage from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders