Prompting Visual-Language Models for Dynamic Facial Expression   Recognition

Zengqun Zhao; Ioannis Patras

arXiv:2308.13382·cs.CV·November 27, 2024·6 cites

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Zengqun Zhao, Ioannis Patras

PDF

Open Access 1 Repo

TL;DR

This paper introduces DFER-CLIP, a novel visual-language model that leverages CLIP and large language models to improve dynamic facial expression recognition in-the-wild, achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper proposes DFER-CLIP, integrating temporal modeling and textual descriptions generated by large language models to enhance facial expression recognition.

Findings

01

Achieves state-of-the-art results on DFEW, FERV39k, and MAFW benchmarks.

02

Effectively captures temporal facial features using Transformer encoders.

03

Utilizes descriptive textual inputs for improved expression recognition.

Abstract

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising -- those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zengqunzhao/dfer-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computing and Algorithms · Emotion and Mood Recognition · Gaze Tracking and Assistive Technology

MethodsAttention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Multi-Head Attention · Absolute Position Encodings · Residual Connection · Label Smoothing