Attention Driven Fusion for Multi-Modal Emotion Recognition
Darshana Priyasad, Tharindu Fernando, Simon Denman, Clinton Fookes,, Sridha Sridharan

TL;DR
This paper introduces a deep learning approach that fuses acoustic and text data using attention mechanisms and specialized feature extraction layers to improve emotion recognition accuracy on the IEMOCAP dataset.
Contribution
It proposes a novel multi-modal fusion method with SincNet for acoustic features and cross attention for text, enhancing emotion classification performance.
Findings
Achieved 3.5% improvement in weighted accuracy over state-of-the-art methods.
Utilized SincNet for more effective acoustic feature extraction.
Introduced cross attention to model N-gram level correlations in text.
Abstract
Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition on combined acoustic and text modalities. Baseline systems model emotion information in text and acoustic modes independently using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks (RNN), followed by applying attention, fusion, and classification. In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification. We utilize a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN. This approach learns filter banks tuned for emotion recognition and provides more effective features compared to directly applying convolutions over the raw speech signal. For text processing, we use two branches (a DCNN and a Bi-direction RNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion-Convolutional Neural Networks
