Multi-Modal Learning for AU Detection Based on Multi-Head Fused   Transformers

Xiang Zhang; Lijun Yin

arXiv:2203.11441·cs.CV·March 23, 2022

Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers

Xiang Zhang, Lijun Yin

PDF

TL;DR

This paper introduces a novel multi-head fused transformer model for facial action unit detection that effectively learns features from multiple modalities and fuses them using a dedicated transformer module, achieving superior results.

Contribution

The paper proposes an end-to-end multi-head fused transformer architecture for AU detection, integrating multi-modal feature learning and fusion with attention mechanisms, which is a novel approach in this domain.

Findings

01

Outperforms state-of-the-art methods on BP4D and BP4D+ datasets.

02

Effective multi-modal feature learning and fusion demonstrated.

03

Analyzes modality contributions to AU detection performance.

Abstract

Multi-modal learning has been intensified in recent years, especially for applications in facial analysis and action unit detection whilst there still exist two main challenges in terms of 1) relevant feature learning for representation and 2) efficient fusion for multi-modalities. Recently, there are a number of works have shown the effectiveness in utilizing the attention mechanism for AU detection, however, most of them are binding the region of interest (ROI) with features but rarely apply attention between features of each AU. On the other hand, the transformer, which utilizes a more efficient self-attention mechanism, has been widely used in natural language processing and computer vision tasks but is not fully explored in AU detection tasks. In this paper, we propose a novel end-to-end Multi-Head Fused Transformer (MFT) method for AU detection, which learns AU encoding features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout