MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for Facial Expression Recognition
Fan Zhang, Xiaobao Guo, Xiaojiang Peng, Alex Kot

TL;DR
This paper introduces MIMIC, a novel self-supervised pre-training and contrastive fine-tuning framework for facial expression recognition using vision Transformers, reducing reliance on large face datasets and improving performance.
Contribution
The paper proposes a new FER training paradigm combining masked image pre-training with mix contrastive fine-tuning, effectively mitigating domain disparity and enhancing representation learning.
Findings
MIMIC outperforms previous training paradigms on benchmark datasets.
Vanilla ViT achieves strong results without complex modules.
Scaling up model size continues to improve performance without saturation.
Abstract
Cutting-edge research in facial expression recognition (FER) currently favors the utilization of convolutional neural networks (CNNs) backbone which is supervisedly pre-trained on face recognition datasets for feature extraction. However, due to the vast scale of face recognition datasets and the high cost associated with collecting facial labels, this pre-training paradigm incurs significant expenses. Towards this end, we propose to pre-train vision Transformers (ViTs) through a self-supervised approach on a mid-scale general image dataset. In addition, when compared with the domain disparity existing between face datasets and FER datasets, the divergence between general datasets and FER datasets is more pronounced. Therefore, we propose a contrastive fine-tuning approach to effectively mitigate this domain disparity. Specifically, we introduce a novel FER training paradigm named Mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Face and Expression Recognition · Emotion and Mood Recognition
MethodsContrastive Learning
