Speech Emotion Recognition with Co-Attention based Multi-level Acoustic   Information

Heqing Zou; Yuke Si; Chen Chen; Deepu Rajan; Eng Siong Chng

arXiv:2203.15326·cs.SD·March 30, 2022

Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, Eng Siong Chng

PDF

Open Access 1 Repo

TL;DR

This paper introduces an end-to-end speech emotion recognition system that leverages multi-level acoustic features and a co-attention mechanism to improve emotion detection accuracy from audio data.

Contribution

It proposes a novel co-attention based fusion method for multi-level acoustic features in speech emotion recognition.

Findings

01

Achieved competitive results on the IEMOCAP dataset.

02

Effectively fused multi-level acoustic features using co-attention.

03

Demonstrated robustness across different cross-validation strategies.

Abstract

Speech Emotion Recognition (SER) aims to help the machine to understand human's subjective emotion from only audio information. However, extracting and utilizing comprehensive in-depth audio information is still a challenging task. In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module. We firstly extract multi-level acoustic information, including MFCC, spectrogram, and the embedded high-level acoustic information with CNN, BiLSTM and wav2vec2, respectively. Then these extracted features are treated as multimodal inputs and fused by the proposed co-attention mechanism. Experiments are carried on the IEMOCAP dataset, and our model achieves competitive performance with two different speaker-independent cross-validation strategies. Our code is available on GitHub.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vincent-zhq/ca-mser
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM