VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain

TL;DR
VoxEmo introduces a comprehensive benchmark for speech emotion recognition using speech LLMs, addressing evaluation challenges and human emotion ambiguity across diverse languages and prompt types.
Contribution
It provides a standardized toolkit with diverse prompts, a distribution-aware soft-label protocol, and a prompt-ensemble strategy to better evaluate speech LLMs in emotion recognition.
Findings
Zero-shot speech LLMs align with human emotion distributions
They lag behind supervised models in hard-label accuracy
Benchmark covers 35 corpora in 15 languages
Abstract
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Mental Health via Writing
