Prompting Visual-Language Models for Dynamic Facial Expression Recognition
Zengqun Zhao, Ioannis Patras

TL;DR
This paper introduces DFER-CLIP, a novel visual-language model that leverages CLIP and large language models to improve dynamic facial expression recognition in-the-wild, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes DFER-CLIP, integrating temporal modeling and textual descriptions generated by large language models to enhance facial expression recognition.
Findings
Achieves state-of-the-art results on DFEW, FERV39k, and MAFW benchmarks.
Effectively captures temporal facial features using Transformer encoders.
Utilizes descriptive textual inputs for improved expression recognition.
Abstract
This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising -- those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computing and Algorithms · Emotion and Mood Recognition · Gaze Tracking and Assistive Technology
MethodsAttention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Multi-Head Attention · Absolute Position Encodings · Residual Connection · Label Smoothing
