Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers
Xiang Zhang, Lijun Yin

TL;DR
This paper introduces a novel multi-head fused transformer model for facial action unit detection that effectively learns features from multiple modalities and fuses them using a dedicated transformer module, achieving superior results.
Contribution
The paper proposes an end-to-end multi-head fused transformer architecture for AU detection, integrating multi-modal feature learning and fusion with attention mechanisms, which is a novel approach in this domain.
Findings
Outperforms state-of-the-art methods on BP4D and BP4D+ datasets.
Effective multi-modal feature learning and fusion demonstrated.
Analyzes modality contributions to AU detection performance.
Abstract
Multi-modal learning has been intensified in recent years, especially for applications in facial analysis and action unit detection whilst there still exist two main challenges in terms of 1) relevant feature learning for representation and 2) efficient fusion for multi-modalities. Recently, there are a number of works have shown the effectiveness in utilizing the attention mechanism for AU detection, however, most of them are binding the region of interest (ROI) with features but rarely apply attention between features of each AU. On the other hand, the transformer, which utilizes a more efficient self-attention mechanism, has been widely used in natural language processing and computer vision tasks but is not fully explored in AU detection tasks. In this paper, we propose a novel end-to-end Multi-Head Fused Transformer (MFT) method for AU detection, which learns AU encoding features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout
