Is Style All You Need? Dependencies Between Emotion and GST-based   Speaker Recognition

Morgan Sandler; Arun Ross

arXiv:2211.08213·eess.AS·November 16, 2022

Is Style All You Need? Dependencies Between Emotion and GST-based Speaker Recognition

Morgan Sandler, Arun Ross

PDF

Open Access 1 Repo

TL;DR

This paper explores using speaker recognition embeddings derived from GST-based models to detect and classify emotions in speech, demonstrating competitive accuracy across multiple datasets.

Contribution

It introduces a novel approach of reusing speaker recognition model weights for emotion classification and proposes a hierarchical classifier to improve accuracy.

Findings

01

Achieved up to 81.2% accuracy on IEMOCAP

02

Proposed hierarchical classifier improves accuracy by 2% on CREMA-D

03

Speaker embeddings effectively encode emotional information

Abstract

In this work, we study the hypothesis that speaker identity embeddings extracted from speech samples may be used for detection and classification of emotion. In particular, we show that emotions can be effectively identified by learning speaker identities by use of a 1-D Triplet Convolutional Neural Network (CNN) & Global Style Token (GST) scheme (e.g., DeepTalk Network) and reusing the trained speaker recognition model weights to generate features in the emotion classification domain. The automatic speaker recognition (ASR) network is trained with VoxCeleb1, VoxCeleb2, and Librispeech datasets with a triplet training loss function using speaker identity labels. Using an Support Vector Machine (SVM) classifier, we map speaker identity embeddings into discrete emotion categories from the CREMA-D, IEMOCAP, and MSP-Podcast datasets. On the task of speech emotion detection, we obtain 80.8%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

morganlee123/deeptalkemotions
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing