TL;DR
This paper introduces an efficient multi-scale CNN and attention-based neural network architecture that effectively combines acoustic and lexical features for speech emotion recognition, outperforming previous methods on the IEMOCAP dataset.
Contribution
The paper proposes a novel multi-scale CNN architecture with attention mechanisms to integrate acoustic and lexical features for improved speech emotion recognition.
Findings
Outperforms state-of-the-art on IEMOCAP with 5% accuracy improvement
Effective fusion of audio and text features using MSCNN and attention modules
Achieves higher weighted and unweighted accuracy in emotion classification
Abstract
Emotion recognition from speech is a challenging task. Re-cent advances in deep learning have led bi-directional recur-rent neural network (Bi-RNN) and attention mechanism as astandard method for speech emotion recognition, extractingand attending multi-modal features - audio and text, and thenfusing them for downstream emotion classification tasks. Inthis paper, we propose a simple yet efficient neural networkarchitecture to exploit both acoustic and lexical informationfrom speech. The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations. Then, a statistical pooling unit (SPU)is used to further extract the features in each modality. Be-sides, an attention module can be built on top of the MSCNN-SPU (audio) and MSCNN (text) to further improve the perfor-mance. Extensive experiments show that the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
