HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding
Chen Li, Peiji Yang, Yicheng Zhong, Jianxing Yu, Zhisheng Wang, Zihao Gou, Wenqing Chen, Jian Yin

TL;DR
HPSU introduces a comprehensive benchmark with over 20,000 samples to evaluate speech models' ability to understand latent intentions and emotions, revealing current models lag behind human perception.
Contribution
The paper presents HPSU, a new benchmark for assessing human-level perception in speech understanding, including a semi-automatic annotation process for complex, real-world data.
Findings
Models underperform compared to humans in understanding implicit emotions.
HPSU covers diverse tasks from speaker attributes to complex inferences.
Semi-automatic annotation improves data quality and efficiency.
Abstract
Recent advances in Speech Large Language Models (Speech LLMs) have led to great progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, particularly in terms of their ability to comprehend latent intentions and implicit emotions in real-world spoken language, remains underexplored. To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a new benchmark for fully evaluating the human-level perceptual and understanding capabilities of Speech LLMs. HPSU comprises over 20,000 expert-validated spoken language understanding samples in English and Chinese. It establishes a comprehensive evaluation framework by encompassing a spectrum of tasks, ranging from basic speaker attribute recognition to complex inference of latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
