Disentangling Prosody Representations with Unsupervised Speech   Reconstruction

Leyuan Qu; Taihao Li; Cornelius Weber; Theresa Pekarek-Rosin; Fuji Ren; and Stefan Wermter

arXiv:2212.06972·cs.SD·September 27, 2023

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Leyuan Qu, Taihao Li, Cornelius Weber, Theresa Pekarek-Rosin, Fuji Ren, and Stefan Wermter

PDF

Open Access

TL;DR

This paper introduces Prosody2Vec, an unsupervised speech reconstruction model that effectively disentangles prosodic features from speech, enhancing emotion recognition and voice conversion tasks.

Contribution

It proposes a novel unsupervised framework with three key components for disentangling prosody, improving emotion recognition and voice conversion performance.

Findings

01

Prosody2Vec captures general prosodic features transferable across tasks.

02

The model surpasses state-of-the-art methods when combined with HuBERT.

03

Effective in both speech emotion recognition and emotional voice conversion.

Abstract

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing