Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any   Voice Conversion using Only Speech Data

Hyungseob Lim; Kyungguen Byun; Sunkuk Moon; Erik Visser

arXiv:2309.02730·eess.AS·December 18, 2023·1 cites

Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

Hyungseob Lim, Kyungguen Byun, Sunkuk Moon, Erik Visser

PDF

Open Access

TL;DR

This paper introduces Stylebook, a novel voice conversion method that uses a self-supervised learning model and style embeddings to accurately transfer speaking styles without text or speaker labels, improving speaker similarity.

Contribution

It presents a new style extraction and transfer approach using stylebook embeddings and attention mechanisms, enhancing style fidelity in any-to-any voice conversion.

Findings

01

Achieves better speaker similarity than baseline models.

02

Effectively captures and transfers rich speaking styles.

03

Maintains computational efficiency with longer utterances.

Abstract

While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing