TL;DR
This paper introduces an attention-based fully convolutional network that effectively recognizes speech emotions by focusing on emotion-relevant regions, handling variable-length speech, and leveraging transfer learning to improve accuracy.
Contribution
The paper proposes a novel attention mechanism within a fully convolutional network for speech emotion recognition, utilizing transfer learning with pre-trained models to enhance performance.
Findings
Achieved 70.4% weighted accuracy on IEMOCAP
Outperformed state-of-the-art methods
Demonstrated effectiveness of attention mechanism and transfer learning
Abstract
Speech emotion recognition is a challenging task for three main reasons: 1) human emotion is abstract, which means it is hard to distinguish; 2) in general, human emotion can only be detected in some specific moments during a long utterance; 3) speech data with emotional labeling is usually limited. In this paper, we present a novel attention based fully convolutional network for speech emotion recognition. We employ fully convolutional network as it is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost. The proposed attention mechanism can make our model be aware of which time-frequency region of speech spectrogram is more emotion-relevant. Considering limited data, the transfer learning is also adapted to improve the accuracy. Especially, it's interesting to observe obvious improvement obtained with natural scene image based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
