Cluster-to-Predict Affect Contours from Speech

G\"okhan Ku\c{s}\c{c}u; Engin Erzin

arXiv:2406.02569·eess.AS·June 6, 2024

Cluster-to-Predict Affect Contours from Speech

G\"okhan Ku\c{s}\c{c}u, Engin Erzin

PDF

Open Access

TL;DR

This paper introduces a novel cluster-to-predict framework for continuous emotion recognition from speech, improving the prediction of affect-contour clusters with high precision using unsupervised optimization.

Contribution

It proposes a new C2P approach that learns affect-contour clusters and predicts them from speech, enhancing dynamic emotion tracking accuracy.

Findings

01

Achieved F1 scores of 0.84 for arousal and 0.75 for valence.

02

Demonstrated the effectiveness of speech-driven affect-contour clustering.

03

Validated on the RECOLA dataset with promising results.

Abstract

Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis