TL;DR
LaCoVL-FER is a novel facial expression recognition framework that combines landmark-guided geometric features with vision-language models to improve robustness and accuracy in complex real-world scenarios.
Contribution
It introduces a landmark-guided adaptive encoder and a vision-language enhancement strategy to effectively fuse geometric and semantic priors for FER.
Findings
Outperforms state-of-the-art on RAF-DB, FERPlus, and AffectNet datasets.
Effectively focuses on key facial regions and suppresses noise.
Enhances generalization and robustness of FER models.
Abstract
Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
