Representation of perceived prosodic similarity of conversational feedback
Livia Qian, Carol Figueroa, Gabriel Skantze

TL;DR
This paper explores how prosodic features of vocal feedback in conversation are perceived and represented, showing that spectral and self-supervised speech models better capture prosody than pitch features, especially within the same speaker, and that contrastive learning improves alignment with human perception.
Contribution
It demonstrates that spectral and self-supervised speech representations effectively encode prosodic similarity and introduces contrastive learning to better align these representations with human perception.
Findings
Spectral and self-supervised representations outperform pitch features in encoding prosody.
Representations more accurately reflect perceived similarity within the same speaker.
Contrastive learning enhances the alignment of speech representations with human perception.
Abstract
Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Emotion and Mood Recognition · Speech Recognition and Synthesis
MethodsALIGN
