FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs
Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, Dian Shao

TL;DR
FineCLIPER introduces a multi-modal, hierarchical approach utilizing textual descriptions, facial cues, and large language models to improve dynamic facial expression recognition, achieving state-of-the-art results efficiently.
Contribution
It presents a novel hierarchical multi-modal framework with textual descriptions and PEFT for effective DFER, surpassing previous methods in accuracy and efficiency.
Findings
Achieves state-of-the-art performance on DFEW, FERV39k, and MAFW datasets.
Effective zero-shot recognition with minimal parameter tuning.
Hierarchical multi-modal design enhances facial expression discrimination.
Abstract
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
MethodsContrastive Language-Image Pre-training
