Video Joint-Embedding Predictive Architectures for Facial Expression Recognition

Lennart Eing; Cristina Luna-Jim\'enez; Silvan Mertes; Elisabeth Andr\'e

arXiv:2601.09524·cs.CV·January 15, 2026

Video Joint-Embedding Predictive Architectures for Facial Expression Recognition

Lennart Eing, Cristina Luna-Jim\'enez, Silvan Mertes, Elisabeth Andr\'e

PDF

Open Access

TL;DR

This paper presents V-JEPAs, a novel embedding-based pre-training method for facial expression recognition that improves performance and generalization over existing approaches, achieving state-of-the-art results on benchmark datasets.

Contribution

Introduces V-JEPAs for FER, enabling effective embedding prediction-based pre-training that surpasses traditional pixel-level methods.

Findings

01

Achieves state-of-the-art performance on RAVDESS dataset.

02

Outperforms all vision-based methods on CREMA-D (+1.48 WAR).

03

Demonstrates strong cross-dataset generalization.

Abstract

This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Face recognition and analysis