Transformer-based Multimodal Information Fusion for Facial Expression   Analysis

Wei Zhang; Feng Qiu; Suzhen Wang; Hao Zeng; Zhimeng Zhang; Rudong An,; Bowen Ma; Yu Ding

arXiv:2203.12367·cs.CV·April 19, 2022·1 cites

Transformer-based Multimodal Information Fusion for Facial Expression Analysis

Wei Zhang, Feng Qiu, Suzhen Wang, Hao Zeng, Zhimeng Zhang, Rudong An,, Bowen Ma, Yu Ding

PDF

Open Access

TL;DR

This paper presents a transformer-based multimodal framework for facial expression analysis that integrates visual, audio, and text features, achieving top performance in ABAW3 competition.

Contribution

It introduces a novel transformer-based fusion module for multimodal affective behavior analysis, combining static and dynamic features for improved AU detection and expression recognition.

Findings

01

Ranked first in ABAW3 competition for expression and AU tracks.

02

Demonstrated effectiveness through extensive evaluations and ablation studies.

03

Improved model performance with data balancing and augmentation techniques.

Abstract

Human affective behavior analysis has received much attention in human-computer interaction (HCI). In this paper, we introduce our submission to the CVPR 2022 Competition on Affective Behavior Analysis in-the-wild (ABAW). To fully exploit affective knowledge from multiple views, we utilize the multimodal features of spoken words, speech prosody, and facial expression, which are extracted from the video clips in the Aff-Wild2 dataset. Based on these features, we propose a unified transformer-based multimodal framework for Action Unit detection and also expression recognition. Specifically, the static vision feature is first encoded from the current frame image. At the same time, we clip its adjacent frames by a sliding window and extract three kinds of multimodal features from the sequence of images, audio, and text. Then, we introduce a transformer-based fusion module that integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsSoftmax · Concatenated Skip Connection · Contrastive Language-Image Pre-training