FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

Haodong Chen; Haojian Huang; Junhao Dong; Mingzhe Zheng; Dian Shao

arXiv:2407.02157·cs.CV·June 25, 2025

FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, Dian Shao

PDF

Open Access

TL;DR

FineCLIPER introduces a multi-modal, hierarchical approach utilizing textual descriptions, facial cues, and large language models to improve dynamic facial expression recognition, achieving state-of-the-art results efficiently.

Contribution

It presents a novel hierarchical multi-modal framework with textual descriptions and PEFT for effective DFER, surpassing previous methods in accuracy and efficiency.

Findings

01

Achieves state-of-the-art performance on DFEW, FERV39k, and MAFW datasets.

02

Effective zero-shot recognition with minimal parameter tuning.

03

Hierarchical multi-modal design enhances facial expression discrimination.

Abstract

Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsContrastive Language-Image Pre-training