CAMP: a Two-Stage Approach to Modelling Prosody in Context
Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba,, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman

TL;DR
This paper introduces CAMP, a two-stage model that improves speech prosody synthesis by disentangling prosodic features and incorporating syntactic and semantic context, significantly narrowing the gap with natural speech.
Contribution
The paper presents a novel two-stage approach that models prosody using word-level representations and context-dependent priors, advancing state-of-the-art in speech synthesis.
Findings
CAMP outperforms previous methods, closing 26% of the gap with natural speech.
Using a jointly-trained duration model enhances prosody quality.
Disentangling prosodic information improves modeling of slow-varying signals.
Abstract
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
