How Speech is Recognized to Be Emotional - A Study Based on Information Decomposition
Haoran Sun, Lantian Li, Thomas Fang Zheng, Dong Wang

TL;DR
This study investigates how different speech information factors contribute to emotion recognition, revealing rhythm as most important and highlighting the challenges of cross-corpus generalization in current models.
Contribution
The paper introduces a decomposition-based analysis of speech signals to identify key emotional factors and assesses their impact on emotion recognition performance.
Findings
Rhythm is the most crucial component for emotional expression.
Cross-corpus emotion recognition performance is poor, often worse than random guessing.
Removing unimportant components can improve cross-corpus results.
Abstract
The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we are able to decompose speech signals into separate information factors (content, pitch, rhythm). Based on this decomposition, we carefully studied the performance of each information component and their combinations. We conducted the study on three different speech emotion corpora and chose an attention-based convolutional RNN as the emotion classifier. Our results show that rhythm is the most important component for emotional expression. Moreover, the cross-corpus results are very bad (even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
