SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Hazim Bukhari; Soham Deshmukh; Hira Dhamyal; Bhiksha Raj; Rita Singh

arXiv:2407.15300·cs.SD·July 23, 2024

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Hazim Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh

PDF

Open Access

TL;DR

This paper introduces SELM, a novel speech emotion recognition model that formulates the task as sequence generation, significantly improving out-of-domain accuracy and enabling few-shot learning.

Contribution

The paper proposes a new formulation of SER inspired by ASR, using a sequence generation approach with an audio-conditioned language model, SELM.

Findings

01

SELM outperforms state-of-the-art baselines on OOD datasets

02

SELM achieves 17% and 7% relative accuracy improvements on RAVDESS and CREMA-D

03

Few-Shot Learning further enhances SELM's performance

Abstract

Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition