MVT: Mask Vision Transformer for Facial Expression Recognition in the wild
Hanting Li, Mingzhe Sui, Feng Zhao, Zhengjun Zha, and Feng Wu

TL;DR
This paper introduces MVT, a pure transformer-based model for facial expression recognition in challenging wild conditions, utilizing mask generation and label rectification to improve accuracy.
Contribution
The paper proposes a novel Mask Vision Transformer (MVT) with a mask generation network and dynamic relabeling, enhancing FER performance in complex real-world scenarios.
Findings
Outperforms state-of-the-art on RAF-DB with 88.62%
Achieves 89.22% on FERPlus, surpassing previous methods
Attains 64.57% on AffectNet-7 and 61.40% on AffectNet-8
Abstract
Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision due to variant backgrounds, low-quality facial images, and the subjectiveness of annotators. These uncertainties make it difficult for neural networks to learn robust features on limited-scale datasets. Moreover, the networks can be easily distributed by the above factors and perform incorrect decisions. Recently, vision transformer (ViT) and data-efficient image transformers (DeiT) present their significant performance in traditional classification tasks. The self-attention mechanism makes transformers obtain a global receptive field in the first layer which dramatically enhances the feature extraction capability. In this work, we first propose a novel pure transformer-based mask vision transformer (MVT) for FER in the wild, which consists of two modules: a transformer-based mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Advanced Computing and Algorithms · Machine Learning and ELM
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Softmax · Multi-Head Attention · Vision Transformer
