Transformer-based Multimodal Information Fusion for Facial Expression Analysis
Wei Zhang, Feng Qiu, Suzhen Wang, Hao Zeng, Zhimeng Zhang, Rudong An,, Bowen Ma, Yu Ding

TL;DR
This paper presents a transformer-based multimodal framework for facial expression analysis that integrates visual, audio, and text features, achieving top performance in ABAW3 competition.
Contribution
It introduces a novel transformer-based fusion module for multimodal affective behavior analysis, combining static and dynamic features for improved AU detection and expression recognition.
Findings
Ranked first in ABAW3 competition for expression and AU tracks.
Demonstrated effectiveness through extensive evaluations and ablation studies.
Improved model performance with data balancing and augmentation techniques.
Abstract
Human affective behavior analysis has received much attention in human-computer interaction (HCI). In this paper, we introduce our submission to the CVPR 2022 Competition on Affective Behavior Analysis in-the-wild (ABAW). To fully exploit affective knowledge from multiple views, we utilize the multimodal features of spoken words, speech prosody, and facial expression, which are extracted from the video clips in the Aff-Wild2 dataset. Based on these features, we propose a unified transformer-based multimodal framework for Action Unit detection and also expression recognition. Specifically, the static vision feature is first encoded from the current frame image. At the same time, we clip its adjacent frames by a sliding window and extract three kinds of multimodal features from the sequence of images, audio, and text. Then, we introduce a transformer-based fusion module that integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
MethodsSoftmax · Concatenated Skip Connection · Contrastive Language-Image Pre-training
