SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition
Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

TL;DR
This paper introduces a comprehensive benchmark for speech emotion recognition (SER) to evaluate model robustness across languages and domains, revealing that generalization remains a key challenge and that some ASR models outperform specialized SER models.
Contribution
It presents a large-scale, multilingual benchmark for SER, incorporating diverse datasets and evaluation methods to assess model generalization in in-domain and out-of-domain scenarios.
Findings
Whisper outperforms SSL models in cross-lingual SER
Benchmark reveals generalization challenges in current SER models
Logit adjustment improves evaluation consistency
Abstract
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
MethodsSparse Evolutionary Training
