Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning
Sarenne Wallbridge, Christoph Minixhofer, Catherine Lai, Peter Bell

TL;DR
This paper investigates how self-supervised learning models can capture prosodic structures in speech beyond lexical content, revealing their effectiveness in encoding both local and long-term prosodic features.
Contribution
It introduces a Masked Prosody Model that leverages SSL to encode prosodic structures at multiple temporal scales, surpassing traditional features.
Findings
SSL representations predict local perceptual labels effectively.
SSL models excel at capturing long-term prosodic structures.
Complex SSL-encoded features outperform classical prosodic features.
Abstract
People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loudness, contributes to such structure independently of the lexical content is unclear. This study leverages self-supervised learning (SSL) to examine the temporal granularity of structures in the acoustic correlates of prosody. Representations from our proposed Masked Prosody Model can predict perceptual labels dependent on local information, such as word boundaries, but provide the most value for labels involving longer-term structures, like emotion recognition. Probing experiments across various perceptual labels show strong relative gains over untransformed pitch, energy, and voice activity features. Our results reveal the importance of SSL training objective timescale and highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
