Spatiotemporal Contrastive Learning of Facial Expressions in Videos
Shuvendu Roy, Ali Etemad

TL;DR
This paper introduces a self-supervised contrastive learning method for facial expression recognition in videos, utilizing a novel temporal augmentation scheme to improve accuracy and outperform existing methods.
Contribution
It presents a new temporal sampling-based augmentation scheme for contrastive learning in FER, enhancing the effectiveness of self-supervised video-based facial expression recognition.
Findings
Achieved 89.4% accuracy on Oulu-CASIA dataset.
Outperformed existing FER methods with the proposed approach.
Temporal augmentation significantly improves recognition performance.
Abstract
We propose a self-supervised contrastive learning approach for facial expression recognition (FER) in videos. We propose a novel temporal sampling-based augmentation scheme to be utilized in addition to standard spatial augmentations used for contrastive learning. Our proposed temporal augmentation scheme randomly picks from one of three temporal sampling techniques: (1) pure random sampling, (2) uniform sampling, and (3) sequential sampling. This is followed by a combination of up to three standard spatial augmentations. We then use a deep R(2+1)D network for FER, which we train in a self-supervised fashion based on the augmentations and subsequently fine-tune. Experiments are performed on the Oulu-CASIA dataset and the performance is compared to other works in FER. The results indicate that our method achieves an accuracy of 89.4%, setting a new state-of-the-art by outperforming other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Contrastive Learning · Residual Connection · Average Pooling · Dense Connections · Global Average Pooling · (2+1)D Convolution · Batch Normalization · R(2+1)D
