EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Niki Maria Foteinopoulou, Ioannis Patras

TL;DR
EmoCLIP introduces a vision-language model that leverages sample-level text descriptions for zero-shot video facial expression recognition, significantly improving performance over existing methods and aiding mental health assessment.
Contribution
The paper presents a novel zero-shot FER approach using sample-level text supervision, enhancing latent representations and extending applications to mental health symptom estimation.
Findings
Outperforms CLIP by over 10% in weighted average recall
Achieves Pearson's r up to 0.85 in schizophrenia symptom estimation
Demonstrates strong agreement with human experts
Abstract
Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Mental Health via Writing · Sentiment Analysis and Opinion Mining
MethodsContrastive Language-Image Pre-training · Focus
