Disentangling Prosody Representations with Unsupervised Speech Reconstruction
Leyuan Qu, Taihao Li, Cornelius Weber, Theresa Pekarek-Rosin, Fuji Ren, and Stefan Wermter

TL;DR
This paper introduces Prosody2Vec, an unsupervised speech reconstruction model that effectively disentangles prosodic features from speech, enhancing emotion recognition and voice conversion tasks.
Contribution
It proposes a novel unsupervised framework with three key components for disentangling prosody, improving emotion recognition and voice conversion performance.
Findings
Prosody2Vec captures general prosodic features transferable across tasks.
The model surpasses state-of-the-art methods when combined with HuBERT.
Effective in both speech emotion recognition and emotional voice conversion.
Abstract
Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
