Learning De-identified Representations of Prosody from Raw Audio

Jack Weston; Raphael Lenain; Udeepa Meepegama; Emil Fristed

arXiv:2107.08248·cs.CL·July 20, 2021

Learning De-identified Representations of Prosody from Raw Audio

Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

PDF

Open Access 1 Video

TL;DR

This paper introduces a contrastive self-supervised method to learn de-identified prosody representations from raw audio, effectively minimizing speaker information while preserving prosodic features for spoken language understanding.

Contribution

The authors develop a novel approach that exploits prosody structure to decouple it from speaker identity without linguistic cues, outperforming prior methods on a new benchmark.

Findings

01

Model performs comparably to state-of-the-art on DAMMP.

02

Probing shows selective learning of prosody subcomponents.

03

Representations are less speaker-identifiable than existing methods.

Abstract

We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning de-identified representations of prosody from raw audio· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing