Is Style All You Need? Dependencies Between Emotion and GST-based Speaker Recognition
Morgan Sandler, Arun Ross

TL;DR
This paper explores using speaker recognition embeddings derived from GST-based models to detect and classify emotions in speech, demonstrating competitive accuracy across multiple datasets.
Contribution
It introduces a novel approach of reusing speaker recognition model weights for emotion classification and proposes a hierarchical classifier to improve accuracy.
Findings
Achieved up to 81.2% accuracy on IEMOCAP
Proposed hierarchical classifier improves accuracy by 2% on CREMA-D
Speaker embeddings effectively encode emotional information
Abstract
In this work, we study the hypothesis that speaker identity embeddings extracted from speech samples may be used for detection and classification of emotion. In particular, we show that emotions can be effectively identified by learning speaker identities by use of a 1-D Triplet Convolutional Neural Network (CNN) & Global Style Token (GST) scheme (e.g., DeepTalk Network) and reusing the trained speaker recognition model weights to generate features in the emotion classification domain. The automatic speaker recognition (ASR) network is trained with VoxCeleb1, VoxCeleb2, and Librispeech datasets with a triplet training loss function using speaker identity labels. Using an Support Vector Machine (SVM) classifier, we map speaker identity embeddings into discrete emotion categories from the CREMA-D, IEMOCAP, and MSP-Podcast datasets. On the task of speech emotion detection, we obtain 80.8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
