Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information
Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, Eng Siong Chng

TL;DR
This paper introduces an end-to-end speech emotion recognition system that leverages multi-level acoustic features and a co-attention mechanism to improve emotion detection accuracy from audio data.
Contribution
It proposes a novel co-attention based fusion method for multi-level acoustic features in speech emotion recognition.
Findings
Achieved competitive results on the IEMOCAP dataset.
Effectively fused multi-level acoustic features using co-attention.
Demonstrated robustness across different cross-validation strategies.
Abstract
Speech Emotion Recognition (SER) aims to help the machine to understand human's subjective emotion from only audio information. However, extracting and utilizing comprehensive in-depth audio information is still a challenging task. In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module. We firstly extract multi-level acoustic information, including MFCC, spectrogram, and the embedded high-level acoustic information with CNN, BiLSTM and wav2vec2, respectively. Then these extracted features are treated as multimodal inputs and fused by the proposed co-attention mechanism. Experiments are carried on the IEMOCAP dataset, and our model achieves competitive performance with two different speaker-independent cross-validation strategies. Our code is available on GitHub.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM
