Learning De-identified Representations of Prosody from Raw Audio
Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

TL;DR
This paper introduces a contrastive self-supervised method to learn de-identified prosody representations from raw audio, effectively minimizing speaker information while preserving prosodic features for spoken language understanding.
Contribution
The authors develop a novel approach that exploits prosody structure to decouple it from speaker identity without linguistic cues, outperforming prior methods on a new benchmark.
Findings
Model performs comparably to state-of-the-art on DAMMP.
Probing shows selective learning of prosody subcomponents.
Representations are less speaker-identifiable than existing methods.
Abstract
We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
