Video Joint-Embedding Predictive Architectures for Facial Expression Recognition
Lennart Eing, Cristina Luna-Jim\'enez, Silvan Mertes, Elisabeth Andr\'e

TL;DR
This paper presents V-JEPAs, a novel embedding-based pre-training method for facial expression recognition that improves performance and generalization over existing approaches, achieving state-of-the-art results on benchmark datasets.
Contribution
Introduces V-JEPAs for FER, enabling effective embedding prediction-based pre-training that surpasses traditional pixel-level methods.
Findings
Achieves state-of-the-art performance on RAVDESS dataset.
Outperforms all vision-based methods on CREMA-D (+1.48 WAR).
Demonstrates strong cross-dataset generalization.
Abstract
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Face recognition and analysis
