Multi Modal Facial Expression Recognition with Transformer-Based Fusion   Networks and Dynamic Sampling

Jun-Hwa Kim; Namho Kim; Chee Sun Won

arXiv:2303.08419·cs.CV·March 21, 2023·5 cites

Multi Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling

Jun-Hwa Kim, Namho Kim, Chee Sun Won

PDF

Open Access

TL;DR

This paper presents a multi-modal facial expression recognition approach that combines audio and visual data using transformer-based fusion and dynamic sampling to improve accuracy in challenging datasets.

Contribution

It introduces a novel Modal Fusion Module with Swin Transformer features and addresses data imbalance through dynamic resampling, advancing multi-modal emotion recognition.

Findings

01

Effective fusion of audio and visual features improves recognition accuracy.

02

Dynamic sampling mitigates dataset imbalance issues.

03

Model performs well in ABAW challenge evaluations.

Abstract

Facial expression recognition is an essential task for various applications, including emotion detection, mental health analysis, and human-machine interactions. In this paper, we propose a multi-modal facial expression recognition method that exploits audio information along with facial images to provide a crucial clue to differentiate some ambiguous facial expressions. Specifically, we introduce a Modal Fusion Module (MFM) to fuse audio-visual information, where image and audio features are extracted from Swin Transformer. Additionally, we tackle the imbalance problem in the dataset by employing dynamic data resampling. Our model has been evaluated in the Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Stochastic Depth · Dense Connections · Label Smoothing