Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Xiang Li; Changhe Song; Xianhao Wei; Zhiyong Wu; Jia Jia; Helen Meng

arXiv:2208.05359·cs.SD·August 22, 2022

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Xiang Li, Changhe Song, Xianhao Wei, Zhiyong Wu, Jia Jia, Helen Meng

PDF

Open Access

TL;DR

This paper presents a novel multi-scale style transfer model for audiobooks that captures both global genre and local prosody, enabling effective cross-speaker reading style transfer without utterance-level style labels.

Contribution

It introduces a chunk-wise multi-scale style model with switchable adversarial classifiers to disentangle speaker timbre and style for improved cross-speaker style transfer in audiobooks.

Findings

01

Successfully transfers reading style to new speakers.

02

Effectively captures both local prosody and global genre.

03

Enhances multi-speaker audiobook generation.

Abstract

Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transfer via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-scale cross-speaker style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing