Towards Cross-speaker Reading Style Transfer on Audiobook Dataset
Xiang Li, Changhe Song, Xianhao Wei, Zhiyong Wu, Jia Jia, Helen Meng

TL;DR
This paper presents a novel multi-scale style transfer model for audiobooks that captures both global genre and local prosody, enabling effective cross-speaker reading style transfer without utterance-level style labels.
Contribution
It introduces a chunk-wise multi-scale style model with switchable adversarial classifiers to disentangle speaker timbre and style for improved cross-speaker style transfer in audiobooks.
Findings
Successfully transfers reading style to new speakers.
Effectively captures both local prosody and global genre.
Enhances multi-speaker audiobook generation.
Abstract
Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transfer via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-scale cross-speaker style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
