VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Hezhao Zhang; Huang-Cheng Chou; Shrikanth Narayanan; Thomas Hain

arXiv:2603.08936·cs.SD·March 11, 2026

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain

PDF

Open Access

TL;DR

VoxEmo introduces a comprehensive benchmark for speech emotion recognition using speech LLMs, addressing evaluation challenges and human emotion ambiguity across diverse languages and prompt types.

Contribution

It provides a standardized toolkit with diverse prompts, a distribution-aware soft-label protocol, and a prompt-ensemble strategy to better evaluate speech LLMs in emotion recognition.

Findings

01

Zero-shot speech LLMs align with human emotion distributions

02

They lag behind supervised models in hard-label accuracy

03

Benchmark covers 35 corpora in 15 languages

Abstract

Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Mental Health via Writing