TL;DR
This paper introduces a novel multimodal deep learning approach for fine-grained speech emotion recognition, utilizing a temporal alignment pooling and cross modality excitement modules to improve prediction accuracy on real-world datasets.
Contribution
It proposes a new model with a temporal alignment mean-max pooling and cross modality excitement modules for enhanced fine-grained emotion recognition from speech.
Findings
Outperforms baseline models in prediction accuracy
Effective in capturing subtle emotions in speech
Model components significantly improve results
Abstract
Speech emotion recognition is a challenging task because the emotion expression is complex, multimodal and fine-grained. In this paper, we propose a novel multimodal deep learning approach to perform fine-grained emotion recognition from real-life speeches. We design a temporal alignment mean-max pooling mechanism to capture the subtle and fine-grained emotions implied in every utterance. In addition, we propose a cross modality excitement module to conduct sample-specific adjustment on cross modality embeddings and adaptively recalibrate the corresponding values by its aligned latent features from the other modality. Our proposed model is evaluated on two well-known real-world speech emotion recognition datasets. The results demonstrate that our approach is superior on the prediction tasks for multimodal speech utterances, and it outperforms a wide range of baselines in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
