SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion   Recognition

Mohamed Osman; Daniel Z. Kaplan; Tamer Nadeem

arXiv:2408.07851·cs.CL·August 16, 2024

SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

PDF

Open Access 1 Repo

TL;DR

This paper introduces a comprehensive benchmark for speech emotion recognition (SER) to evaluate model robustness across languages and domains, revealing that generalization remains a key challenge and that some ASR models outperform specialized SER models.

Contribution

It presents a large-scale, multilingual benchmark for SER, incorporating diverse datasets and evaluation methods to assess model generalization in in-domain and out-of-domain scenarios.

Findings

01

Whisper outperforms SSL models in cross-lingual SER

02

Benchmark reveals generalization challenges in current SER models

03

Logit adjustment improves evaluation consistency

Abstract

Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spaghettiSystems/serval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition

MethodsSparse Evolutionary Training