SVFAP: Self-supervised Video Facial Affect Perceiver
Licai Sun, Zheng Lian, Kexin Wang, Yu He, Mingyu Xu, Haiyang Sun, Bin, Liu, and Jianhua Tao

TL;DR
SVFAP introduces a self-supervised learning framework for video facial affect analysis that leverages masked autoencoding and a novel Transformer architecture, significantly improving performance across multiple affect recognition tasks without requiring labeled data.
Contribution
The paper proposes SVFAP, a self-supervised approach with a novel Transformer encoder, enabling effective large-scale pre-training on unlabeled videos for facial affect analysis.
Findings
Outperforms state-of-the-art on nine datasets
Effective in multiple affect recognition tasks
Reduces computational costs with novel Transformer design
Abstract
Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction. Previous studies mainly focus on developing various deep learning architectures and training them in a fully supervised manner. Although significant progress has been achieved by these supervised methods, the longstanding lack of large-scale high-quality labeled data severely hinders their further improvements. Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised methods. Specifically, SVFAP leverages masked facial video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos. Considering that large spatiotemporal redundancy exists in facial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Pointwise Convolution · Softmax · Label Smoothing · Multi-Head Attention · Adam · Dropout · Absolute Position Encodings
