Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and   Speaker-wise Normalization in Speech Synthesis

Chunyu Qiang; Peng Yang; Hao Che; Xiaorui Wang; Zhongyuan Wang

arXiv:2212.06397·cs.SD·December 14, 2022

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Chunyu Qiang, Peng Yang, Hao Che, Xiaorui Wang, Zhongyuan Wang

PDF

Open Access

TL;DR

This paper introduces a novel style transfer method for speech synthesis that does not require style labels, utilizing a quantized VAE and speaker-wise normalization to effectively transfer style across speakers.

Contribution

It presents a label-free style transfer approach using a Q-VAE and speaker-wise normalization, improving style extraction without relying on annotated style labels.

Findings

01

Outperforms baseline in style transfer quality

02

Effectively reduces source speaker leakage

03

Enhances style extraction with contrastive data augmentation

Abstract

Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsBatch Normalization