Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition
Hengshun Zhou, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and, Chin-Hui Lee

TL;DR
This paper introduces a novel multimodal fusion attention network utilizing adaptive and multi-level factorized bilinear pooling for improved audio-visual emotion recognition, achieving state-of-the-art results on multiple datasets.
Contribution
It proposes a new fusion network with adaptive and multi-level bilinear pooling, enhancing the integration of audio and visual features for emotion recognition.
Findings
Achieved 71.40% accuracy with the FCN method on speech emotion recognition.
Attained 63.09% and 75.49% accuracy on AFEW and IEMOCAP datasets, respectively.
Outperformed existing methods in multimodal emotion recognition tasks.
Abstract
Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating selfattention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Music and Audio Processing
