Representation of perceived prosodic similarity of conversational feedback

Livia Qian; Carol Figueroa; Gabriel Skantze

arXiv:2505.13268·cs.CL·May 20, 2025

Representation of perceived prosodic similarity of conversational feedback

Livia Qian, Carol Figueroa, Gabriel Skantze

PDF

Open Access

TL;DR

This paper explores how prosodic features of vocal feedback in conversation are perceived and represented, showing that spectral and self-supervised speech models better capture prosody than pitch features, especially within the same speaker, and that contrastive learning improves alignment with human perception.

Contribution

It demonstrates that spectral and self-supervised speech representations effectively encode prosodic similarity and introduces contrastive learning to better align these representations with human perception.

Findings

01

Spectral and self-supervised representations outperform pitch features in encoding prosody.

02

Representations more accurately reflect perceived similarity within the same speaker.

03

Contrastive learning enhances the alignment of speech representations with human perception.

Abstract

Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Emotion and Mood Recognition · Speech Recognition and Synthesis

MethodsALIGN