Information Fusion in Attention Networks Using Adaptive and Multi-level   Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Hengshun Zhou; Jun Du; Yuanyuan Zhang; Qing Wang; Qing-Feng Liu; and; Chin-Hui Lee

arXiv:2111.08910·cs.SD·November 18, 2021

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Hengshun Zhou, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper introduces a novel multimodal fusion attention network utilizing adaptive and multi-level factorized bilinear pooling for improved audio-visual emotion recognition, achieving state-of-the-art results on multiple datasets.

Contribution

It proposes a new fusion network with adaptive and multi-level bilinear pooling, enhancing the integration of audio and visual features for emotion recognition.

Findings

01

Achieved 71.40% accuracy with the FCN method on speech emotion recognition.

02

Attained 63.09% and 75.49% accuracy on AFEW and IEMOCAP datasets, respectively.

03

Outperformed existing methods in multimodal emotion recognition tasks.

Abstract

Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating selfattention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Music and Audio Processing