Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition

Xu Zhang; Longbing Cao; Runze Yang; Zhangkai Wu

arXiv:2602.13259·cs.SD·February 17, 2026

Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition

Xu Zhang, Longbing Cao, Runze Yang, Zhangkai Wu

PDF

Open Access

TL;DR

PhysioSER introduces a physiology-informed vocal spectrotemporal representation method that enhances speech emotion recognition by modeling amplitude and phase dynamics based on voice anatomy, leading to interpretable and efficient emotion detection.

Contribution

The paper presents PhysioSER, a novel framework that integrates physiological voice features with deep learning for improved, interpretable speech emotion recognition.

Findings

01

Effective across 14 datasets and 10 languages

02

Validated in real-time humanoid robot deployment

03

Outperforms existing models in interpretability and efficiency

Abstract

Speech emotion recognition (SER) is essential for humanoid robot tasks such as social robotic interactions and robotic psychological diagnosis, where interpretable and efficient models are critical for safety and performance. Existing deep models trained on large datasets remain largely uninterpretable, often insufficiently modeling underlying emotional acoustic signals and failing to capture and analyze the core physiology of emotional vocal behaviors. Physiological research on human voices shows that the dynamics of vocal amplitude and phase correlate with emotions through the vocal tract filter and the glottal source. However, most existing deep models solely involve amplitude but fail to couple the physiological features of and between amplitude and phase. Here, we propose PhysioSER, a physiology-informed vocal spectrotemporal representation learning method, to address these issues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Voice and Speech Disorders · Speech Recognition and Synthesis