Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on   Facial Action Unit Detection

Xiang Zhang; Huiyuan Yang; Taoyue Wang; Xiaotian Li; Lijun Yin

arXiv:2209.12244·cs.CV·August 24, 2023

Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on Facial Action Unit Detection

Xiang Zhang, Huiyuan Yang, Taoyue Wang, Xiaotian Li, Lijun Yin

PDF

Open Access 1 Video

TL;DR

This paper introduces Multimodal Channel-Mixing, a novel early fusion masked autoencoder for facial Action Unit detection that improves multi-modal feature learning and surpasses existing methods.

Contribution

The paper proposes a new multi-modal reconstruction network with channel-mixing and masked autoencoding for robust AU detection, emphasizing early fusion and multi-modal learning.

Findings

01

Outperforms state-of-the-art baseline methods

02

Effective in learning robust multi-modal representations

03

Reduces channel redundancy and enhances fusion capabilities

Abstract

Recent studies have focused on utilizing multi-modal data to develop robust models for facial Action Unit (AU) detection. However, the heterogeneity of multi-modal data poses challenges in learning effective representations. One such challenge is extracting relevant features from multiple modalities using a single feature extractor. Moreover, previous studies have not fully explored the potential of multi-modal fusion strategies. In contrast to the extensive work on late fusion, there are limited investigations on early fusion for channel information exploration. This paper presents a novel multi-modal reconstruction network, named Multimodal Channel-Mixing (MCM), as a pre-trained model to learn robust representation for facilitating multi-modal fusion. The approach follows an early fusion setup, integrating a Channel-Mixing module, where two out of five channels are randomly dropped.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on Facial Action Unit Detection· youtube

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Anomaly Detection Techniques and Applications